Edited by Himanshu Mittal. Lexical Analysis Phase

Edited by Himanshu Mittal Lexical Analysis Phase

Lexical Analyzer The main task of Lexical analysis is to read input characters of source program and group them into lexemes and produce as output a sequence of tokens for each lexeme in the source file. Other tasks: Stripping out comments and whitespaces Co-relating error messages.

Role of Lexical Analyzer Parser issues getnexttoken command that causes the lexical analyzer to read input characters until it can identify the next lexeme and produce it as next token which it returns to parser. The stream of tokens is sent to the parser for syntax analysis.

Patterns Pattern: The definition used for recognizing tokens. Regular expressions are used for defining/specifying patterns. NFA/DFA is used to implement the regular expressions.

Lexeme Lexeme: A sequence of characters (or substring) in the input, identified by a pattern(or regular expression) as token. Eg., in a C stmt, int a = 0 Lexemes are int, a, =, 0.

Token Class Token Class: A category to which a lexeme can belong. Some common token class are Keyword, Identifier, Digit, Operator, Literals. Eg., in C stmt, int a = 0: int belongs to keyword class, a belongs to identifier class, = belongs to operator class, 0 belongs to digit class.

Another Example In C stmt, char v= hello //v is a variable char belongs to keyword class, v belongs to identifier class, = belongs to operator class, hello belongs to literal class, //v is a variable belongs to comment class.

Token Token: Symbol used for representing lexeme. The representation form for tokens can vary. Generally, tokens are represented as pair of token class and lexeme. <token_class,lexeme> Eg., token of int is <keyword,int>. Note: Patterns for tokens are specified through regular expression. Recognition of token is done through finite automata(nfa/dfa).

Lex Tool Tool for constructing lexical analyzers from special purpose notations based on regular expressions. Tool widely used to specify lexical analyzers for a variety of languages. Free available with Unix Terminal

Lex Tool Lex program is a file with extension.l that contains regular expressions, together with the actions to be taken when each expression is matched. Lex compiler produces an output file, usually called lex.yy.c, that contains C code defining a procedure yylex() which is table driven implementation of a DFA corresponding to the regular expression in the lex file and that operates like a gettoken procedure. The lex.yy.c file is then compiled and linked to a main program to get a running program using C compiler.

Lex Process Create a file, named as filename.l, that contains specifications/regular expression. lex compiler processes filename.l and produces lex.yy.c file. The C compiler turns lex.yy.c file into a.out file. Steps to execute a lex file on unix terminal: lex filename.l gcc lex.yy.c ll (-ll means link with lex library, use lfl if using flex)./a.out Note: lex contains a function yylex( ) which does actual lexical analysis.

Creating a Lexical Analyzer with LEX Lex source program LEX Compiler Lex.yy.c Lex.yy.c C Compiler a.out Input Stream a.out Sequence of tokens

Lex File Format <Definitions>...... %% <Rules>...... %% <Supplementary code>...... #includes #defines RegExps Pattern/Action Pairs {pattern1} {action1} {pattern2} {action2} Additional code (Not always needed)

Eg: Simple Lex Program Small program Lex program that prints everything as output that is entered as input. File Name: scan.l %%. \n %% main() { } yylex(); ECHO; We get this by default in Lex! This form will read from stdin. To terminate type: ctrl/d Put this code in a file called: scan.l Run lex: lex scan.l Compile: gcc lex.yy.c -ll Run by typing:./a.out or a.out < somefile.txt

Another Example %{ #include<stdio.h> %} digit [0-9]+ letter [a-za-z]+ How to include c code id {letter}({letter} {digit})* %% {id} { printf( Found identifier %s,yytext); } {digit} {printf( Found Digit %s,yytext); } %% yytext is an Internal variable containing text of word matched

Eg: To count of variables in input string %{ #include<stdio.h> int count=0; %} digit [0-9]+ letter [a-za-z]+ id {letter}({letter} {digit})* %% {id} { count++ } %% int main() { yylex(); printf( The no. of variables in string: %d,count); return(0); }

Main Points Text that is not matched is echoed as read. Thus, there is an implied ECHO. If you don't specifiy a main you get one for free!!! Lex patterns only match a given input character or string once Lex executes the action for the longest possible match for the current input. If two patterns are of same length, then lex executes the action of pattern that has high priority(or first in pattern sequence).

Example: AAA { printf("<found 3 A's>"); } AA { printf("<found 2 A's>"); } Given Input: AAAAAAAA Will print: <Found 3 A's><Found 3 A's><Found 2 A's> The scanning continues unless a value is returned!

Lex Predefined Variables yytext --> a string containing the matched lexeme yyleng --> the length of the matched lexeme yyin --> the input stream pointer the default input of default main() is stdin yyout --> the output stream pointer the default output of default main() is stdout. E.g.,./a.out < inputfile > outfile E.g. [a-z]+ [a-z]+ [a-za-z]+ printf( %s, yytext); ECHO; {words++; chars += yyleng;} PLLab, NTHU,Cs2403 Programming Languages 19

Lex Library Routines yylex() The default main() contains a call of yylex() yymore() return the next token yyless(n) retain the first n characters in yytext yywarp() is called whenever Lex reaches an end-of-file The default yywarp() always returns 1 PLLab, NTHU,Cs2403 Programming Languages 20

Pattern Matching Primitives Metacharacter Matches. any character except newline \n newline * zero or more copies of the preceding expression + one or more copies of the preceding expression? zero or one copy of the preceding expression ^ beginning of line / complement $ end of line a b a or b (ab)+ one or more copies of ab (grouping) [ab] a or b a{3} 3 instances of a a+b literal a+b (C escapes still work)

Review of Lex Predefined Variables Name char *yytext int yyleng FILE *yyin FILE *yyout int yylex(void) char* yymore(void) int yyless(int n) int yywrap(void) ECHO REJECT INITAL BEGIN Function pointer to matched string length of matched string input stream pointer output stream pointer call to invoke lexer, returns token return the next token retain the first n characters in yytext wrapup, return 1 if done, 0 if not done write matched string go to the next alternative rule initial start condition condition switch start condition