Chapter 3: Lexical Analysis A simple way to build a lexical analyzer is to construct a diagram that illustrates the structure of tokens of the source language, and then to hand translate the diagram into a Program for finding tokens. Pattern -directed programming is used in many area other than compiler such as : Query Languages Information Retrieval systems Chapter 3: Lexical Analysis 1 The Role of the Lexical Analyzer Its main task is to read the input characters & produce as output a sequence of tokens that the parser uses for syntax analysis. Also, it removes (skip out) white space, it keeps track of the number of new line characters to associate this number with an error message when applicable. Chapter 3: Lexical Analysis 2 1
The Role of the Lexical Analyzer Some lexical analyzer divided into a cascade of two phases: Scanning : Simple task (eliminate spaces). Lexical analysis More complex task. Chapter 3: Lexical Analysis 3 The Role of the Lexical Analyzer Why to separate the analysis phases of compiling into lexical analysis & parsing? Simplicity efficiency portability. Chapter 3: Lexical Analysis 4 2
Tokens, Patterns, Lexemes Tokens: a set of strings (id, num, opr). A Lexeme: is sequence of characters in the source program that is matched by the pattern for a token. Pattern: a rule defining a token Opr: +,*, /,-,>, <, <=, >=, <>, = ID: letter followed by letters or digits Chapter 3: Lexical Analysis 5 Lexical Errors When no match with any pattern, an error is occurred, the best strategy is to delete successive characters until we find a wellformed token. Chapter 3: Lexical Analysis 6 3
Error Recovery Deleting an extraneous character. Inserting a missing character. Replacing an incorrect character by a correct one. Transposing two adjacent characters. Chapter 3: Lexical Analysis 7 Specification Of Tokens Strings & languages: Alphabet: finite set of symbols String is finite sequence of characters Language set of strings Regular Expressions important notation for specifying patterns Chapter 3: Lexical Analysis 8 4
Regular Expressions e.g. Let = {a,b} The regular exp. a b denotes the set {a, b} (a b)(a b) denotes (aa,ab,ba,bb} or aa ab ba bb a* denotes the set of all strings of zero or more a s (a b)* denotes the set of all strings containing zero or more stances of a or b. Chapter 3: Lexical Analysis 9 Regular Expressions A regular definition of id id letter (letter digit)* letter A. Z a.. z digit 0. 9 Chapter 3: Lexical Analysis 10 5
Recognition Of tokens Identifier: Letter A Z a z Digit 0 9 Id letter (letter digit)* Chapter 3: Lexical Analysis 11 Recognition Of tokens Unsigned number in Pascal are strings such as: 5280,39.37,6.336E4,1.894E-4 digit 0 1. 9 digits digit digit* optional-fraction.digits empty optional-exponent E(+ - empty) digits empty num digits optional-fraction optional-exponent Chapter 3: Lexical Analysis 12 6
Recognition Of tokens digit 0. 9 digits digit digit* optional-fraction (.digits)? optional exponent (E(+ -)?digits)? num digits optional-fraction optionalexponent where r? is same as r empty Chapter 3: Lexical Analysis 13 Recognition Of tokens we assume that lexemes are separated by white space, consisting of non-null sequence of blanks, tabs, and new-lines. <delim> blank tab new-line ws <delim> + Chapter 3: Lexical Analysis 14 7
Transition diagram for >= Chapter 3: Lexical Analysis 15 Transition diagram for RelOp Chapter 3: Lexical Analysis 16 8
Identifiers Remember that we will treat keywords as identifiers, rather than encode the keywords into TDs. Chapter 3: Lexical Analysis 17 The return statement of the accepting state uses: gettoken looks in the symbol table. if lexeme is a keyword, then KW is returned. Otherwise, the token ID is returned. install-id: if gettoken return KW, it returns 0 if lexeme found in the symbol table, return a pointer to the existing entry. if lexeme is not found in symbol table, it is installed as a variable and a pointer to the new entry is returned. Chapter 3: Lexical Analysis 18 9
Numbers Chapter 3: Lexical Analysis 19 Chapter 3: Lexical Analysis 20 10
Chapter 3: Lexical Analysis 21 * How to handle Errors? Case 1: If unrecognized char, then non of the case options will be fired, at the end of the cases there is error message. Case 2: If there is in the middle of a token like#, then unexpected char, then ERROR message and start looking for a new token. Chapter 3: Lexical Analysis 22 11
How to handle Comments: Chapter 3: Lexical Analysis 23 Lexical_analyzer ( ) { while (!EOF(input)) { switch (state) { case 0: c= nextchar; //c is lookahead char if (c = = blank c = = tab c = = newline) else if (c = = < ) state=1; else if (c = = = ) state=5; else if (c= = > ) state=6; else state =9; Chapter 3: Lexical Analysis 24 12
case 1: c= nextchar( ); If (c = = = ) state =2 ; else if ( c= > ) state=3; else state =4; case 2: tokenval =LE; return (Relop) ; Chapter 3: Lexical Analysis 25 case 4: Case 5: Case 6: retract (1); TokenVal=LT; TokenType=Relop; tokenval=eq; TokenType=Relop; c=nextchar( ); if (c = = = ) state=7; else state =8; Chapter 3: Lexical Analysis 26 13
Case 7 : Case 8: Case 9: tokenval= GE; TokenType=Relop; retract (1); TokenVal=GT; TokenType=Relop; if (isletter( c ) ) state =10; else state=12; Chapter 3: Lexical Analysis 27 case 10: c=nextchar(); if (isletter( c )) state =10; else if (isdigit(c))state =10; else state = 11; Case 11: retract(1); TokenVal=install id(); TokenType=gettoken( ) ; Chapter 3: Lexical Analysis 28 14
case 13: case14: c=nextchar(); if (isdigit (c )) state =13; else if (c=. ) state =14; else if (c= E )state=16; else state=20; break c=nextchar( ); if (isdigit(c ) ) state =15; else ERROR; Chapter 3: Lexical Analysis 29 case 15 : c= nextchar(); if (isdigit ( c )) state=15; else if (c= E ) state=16; else state=21; case 16: c=nextcahr( ) if (c = + c= - ) state=17; else if (isdigit(c ))state=18; else ERROR; case 17 : c= nextchar(); if (isdigit(c ))state=18; else ERROR; Chapter 3: Lexical Analysis 30 15
case 18: c= nextchar if (isdigit(c )) state=18; else state=19; case 19: case 20 : case 21: retract (1); TokenVAl=Lexeme; TokenType=NUM; case 22: if (c = + ) state =23; else if(c= - ) state =24; else if (c= / )state=25; else if (c= * ) state =26; else state =27; Chapter 3: Lexical Analysis 31 case 23 : TokenVal=Add; TokenType=opr; case 24 : TokenVal=Sub; TokenType=opr; case 25: TokenVal=Mul; TokenType=opr; Chapter 3: Lexical Analysis 32 16
case 26: TokenVal=div; TokenType=opr; case 27: if ( c= ; ) state=28; else If (c=, ) state =28; case 28: TokenVal=; ; TokenType=pun; } } } Chapter 3: Lexical Analysis 33 17