Problem: Read in characters and group them into tokens (words). Produce a program listing. Do it efficiently.

3) Lexical Analysis Input: Program to be compiled. Output: Stream of (token, value) pairs. Problem: Read in characters and group them into tokens (words). Produce a program listing. Do it efficiently. Analysis of compiler performance shows that most of the execution time is spent in the lexical analysis phase. 37 9/1/01

Rationale: 1) Modular design, allowing the compiler to be partitioned into pieces that can be developed independently. 2) It is more efficient for the parser to deal with words, not characters. Incorrect words are never seen by the parser. 3) Isolates character set dependencies ASCII vs. EBCDIC 4) Isolates representation of symbols <> instead of.ne. or!= {... } instead of begin... end 38 9/1/01

What is a token? A token is a placeholder for a logical entity in a programming language. Some tokens include: keywords, constants, operators, punctuation, and identifiers. Tokens are not: white space and comments. 39 9/1/01

Example of tokenizing if( price + gst - rebate <= 10.00 ) gift = false; Token Token # Value Comment if 10 keyword ( 20 left parenthesis price 50 price identifier + 1 + add operator gst 50 gst identifier - 1 - add operator rebate 50 rebate identifier <= 2 <= relational operator 10.00 51 10.00 float constant ) 21 right parenthesis gift 50 gift identifier = 3 assign operator false 50 false identifier ; 4 separator 40 9/1/01

Simple tokenizer There is an obvious way of recognizing tokens. Consider recognizing the tokens end, else and identifiers: c = getchar(); if( c == 'e' ) { c = getchar(); if( c == 'n' ) { c = getchar(); if( c == 'd' ) { next = getchar(); if(!isletter( next ) &&!isdigit( next ) ) return( KEYWORD_END ); else { /* Read to end of identifier. */ return( IDENTIFIER ); } } else { /* Read to end of identifier. */ return( IDENTIFIER ); } } else if( c == 'l' ) { /* Look for else keyword or identifier */ } else { /* Look for other keywords or identifiers */ } } 41 9/1/01

This form of coding is easy to do, but is tedious! We can make it more modular and easier to construct, at the cost of some efficiency: token GetToken() { SkipWhiteSpace(); c = getchar(); if( isletter( c ) ) return( ScanForIdentifier() ); if( isdigit( c ) ) return( ScanForConstant() ); switch( c ) { case '(': return( LEFT_PAREN ); case ')': return( RIGHT_PAREN ); case '+': return( ScanForAddorIncrement() ); case '-': return( ScanForSuborDecrement() ); case '=': return( ScanForEqualsorAssign() ); case '/': return( ScanForCommentorDivide() );... default: return( ERROR ); } } 42 9/1/01

We would like to automate this process, having a tool to build a fast, compact lexical analyzer for us automatically. It turns out that most tokens can be easily defined by a regular grammar: the user defines tokens in a form equivalent to regular grammars, and the system converts the grammar into code. Variety of tools to do this, all similar in their approach to automating lexical analysis. 43 9/1/01

Regular expressions Regular grammars can be expressed in several other forms. One popular form is regular expressions. Three operations: concatenation a b = a b alternation a or b = a b repetition a a... a = a* a ( a... a ) = a + Note that regular expressions are equivalent to regular grammars. All regular expressions can be expressed as a regular grammar; all regular grammars can be converted to an equivalent regular expression. It is easy to show the equivalence... 44 9/1/01

Examples: Let s use the following 2 macros to simplify our solutions: LETTER = ( a b... z A B... Z ) DIGIT = ( 0 1... 9 ) 1) An identifier must begin with a letter and can be followed by an arbitrary number of letters and digits. Regular grammar: ID : LETTER ID_REST ID_REST : LETTER ID_REST DIGIT ID_REST <> Regular expression: ID : LETTER ( LETTER DIGIT ) * Syntax diagram: LETTER ID LETTER DIGIT 45 9/1/01

2) A floating point number is one or more digits, followed by a decimal place, followed by one or more digits. Regular grammar: FLOAT : DIGIT FLOAT. DIGITS DIGITS : DIGIT DIGITS DIGIT Not correct, since it allows there to be no digits to the left of the decimal place. Regular expression: FLOAT : ( DIGIT + ) '.' ( DIGIT + ) Syntax diagram: FLOAT DIGIT DIGIT 46 9/1/01

These rules allow infinite identifiers and infinitely precise numbers. In the real world, there has to be restrictions: Identifiers: Some programming languages impose a limit on the length of an identifier. Fortran, for example, only considers the first 6 characters in an identifier s name. C used to recognize only the first 8 characters (caseless) for external names. The advantage? Simplicity of saving names in the symbol table. The disadvantage? Only to the user. Numbers: Machines have finite precision. Therefore, a limit must be placed on the number of digits. Some compilers generate error messages if you use a number that is too large/small/precise. Others do not flag an error and give you a questionable alternative. The lexical rules must be supplemented by additional language/hardware constraints. 47 9/1/01

UNIX and regular expressions Regular expressions are an integral part of the UNIX tool set: editors (ed, ex, vi ) sed awk grep / fgrep / egrep specifying file names to shells (sh, csh, tcsh) For example: egrep "(John Jonathan).*Schaeffer" *.c where means alternation, () is used for grouping,. matches any character and * causes the previous character to match an arbitrary number of times. 48 9/1/01

Exercise 1 A real number consists of 2 parts: 1) The integer part, consisting of one or more digits. A number may not begin with a zero, unless the integer is just zero. 2) The decimal part, consisting of a decimal point followed by one or more digits. Construct a regular expression for real numbers. Solution: 49 9/1/01

Finite automata Yet another form that is equivalent to a regular grammar is a finite automata. Draw a diagram where terminal symbols are transitions and non-terminals are nodes. For example: S : a S b S a A A : a C C : a C b C b B B : b D b D : a D b D a b a a a a a b b S A C B D b b b F b a, b Here we have added an F (final) state to acknowledge when we have reached a point where we know that the input is legal. 50 9/1/01

This diagram is equivalent to the regular expression: ( a b ) * a a ( a b ) * b b ( a b ) * any string containing two a s followed by two b s. To determine if the input is accepted, move from state to state, guided by the input characters. However, this is a non-deterministic finite automata: in state S, on an a do you stay in state S or go to state A? In a deterministic finite automata, each state has only one transition for each input character. It turns out that... regular grammar, regular expression, non-deterministic finite automata (NDFA), and deterministic finite-automata (DFA) are all equivalent. 51 9/1/01

Converting an NDFA to a DFA State a b S S, A S A C error B error D, F C C B, C D D, F D, F In state S, on input a, do you go to state S or A? Don t make up your mind just yet; postpone the decision by going to a new state SA: State S on input a, go to state SA State S on input b, go to state S In this new SA state, on input a, where do you go? If in state S, would go to S or A. In state A on an a, would go to state C. Create a new state SAC which reflects all 3 possibilities: State SA on input a, go to state SAC State SA on input b, go to state S 52 9/1/01

State a b S S, A S A C error B error D, F C C B, C D D, F D, F State a b S SA S SA SAC S SAC SAC SBC SBC SAC SBCDF= F b a S a a b b SA SAC SBC F b a 53 9/1/01

Once we have a DFA, the code is easy: S: c = getchar(); if( c == 'a' ) go to SA; if( c == 'b' ) go to S; error(); SA: SAC: c = getchar(); if( c == 'a' ) go to SAC; if( c == 'b' ) go to S; error(); c = getchar(); if( c == 'a' ) go to SAC; if( c == 'b' ) go to SBC; error(); 54 9/1/01

Or we could build a table-driven lexical analyzer: token LexicalDriver( LexTable ) { state = laststate; for( ; ; ) { c = NextChar(); state = LexTable[ state, c ]; if( state!= error && state!= finalstate ) { AddToToken( c ); AdvanceInput(); } else break; } if( state!= finalstate ) return( ERROR ); else return( Token[ finalstate] ); } 55 9/1/01

Lexical analyzer generators How does a lexical analyzer generator work? Get input from the user who defines the tokens in a form that is equivalent to regular grammars (usually regular expressions or syntax diagrams). Turn the input into a non-deterministic finite automata. Convert a non-deterministic finite automata to a deterministic finite automata. Generate code to recognize the deterministic finite automata. 56 9/1/01

Exercise 2 Given the grammar: S : a S a A b S b B A : b B a C B : a C b C : b S b B Draw the non-deterministic finite automata represented by this grammar. 57 9/1/01

Construct the deterministic finite automata: State a b Draw a diagram of the deterministic finite automata: 58 9/1/01

Output listing and lexical errors A compiler must produce a listing of the program being compiled, augmented with informative error messages that are inserted near the locations of the errors. Usual technique for producing a listing is to have the lexical analyzer print the text as it is tokenizing. A complication is that errors should not be printed as they occur, since they would appear in the middle of lines. Instead, the errors should be queued and only output once a new-line is reached. Once a lexical error occurs, the lexical analyzer must recover from it and continue to tokenize the input. There are two simple approaches to lexical error recovery: 1) Ignore all characters read as part of the erroneous token and start a new token. 2) Delete the first character read of the erroneous token and start re-reading the input after the deleted character. This has the extra complication that input has to be read and re-read. One error that requires special handling is a runaway string. Be careful not to propagate error messages! 59 9/1/01

Lex - a lexical analyzer Source code: Look at the file ex1.l Lexing: lex ex1.l Lex output: Look at the file lex.yy.c Does any of it make sense? Compilation: make Execution: ex1 Modify the code: Can you modify the rules so that constants cannot have a leading 0? What about arbitrarily long identifier names? Comments? Floating point numbers? 60 9/1/01