Chapter 3 -- Scanner (Lexical Analyzer) Job: Translate input character stream into a token stream (terminals) Most programs with structured input have to deal with this problem Need precise definition of tokens Strings: xx x, "yy\"y" Reals: 0.3 vs.3 Others: 1..10 Regular Expressions -- simple patterns (a b c)(a b c _)* (0-9)+ Deterministic Finite Automata -- Recognizers Scanner Generators Input: Regular Expressions Output: DFA in a program
LEX & FLEX (One of many) Input: description file Regular expressions associated "action" Lex: standard AT&T original program Flex: open source re-implementation with added features File format: << definitions and %{ initial code %} >> %% << rules and associated actions >> %% << extra code >>
Simple Flex Example %{ %} int num_lines = 0, num_chars = 0; %% \n ++num_lines; ++num_chars;. ++num_chars; %% int main(void) { yylex(); printf( "# of lines = %d, # of chars = %d\n", num_lines, num_chars ); }
ATL/0 scanner in lex (definitions) %{ /* scan.l: An ATL/0 scanner. 1/13/94 */ #include "defs.h" #include "global.h" #include "parse.h" #define YY_NO_UNPUT %}
parse.h (created by yacc) #define END 257 #define READ 258 #define BEGINSY 259 #define WRITE 260 #define INTEGER 261 #define PROGRAM 262 #define WRITELN 263 #define VARIABLE 264 #define ASSIGN 265 #define CONST 266 #define ID 267 typedef union { int i_value; char *s_value; syntax_node *node_ptr; } YYSTYPE; extern YYSTYPE yylval; BEGIN is a flex reserved name, BEGINSY is BEGIN in ATL/1 source
ATL/0 scanner in lex (rules) [ \t]+ { /* ignore spaces and tabs */ if (list_src) ECHO; } \n { if (list_src) ECHO; line_no++; dump_errors (); if (list_src) fprintf (yyout, "%5d: ", line_no); } "+" "-" ";" "(" ")" "," "." ":" { if (list_src) ECHO; return((int)yytext[0]); }
ATL/0 scanner in lex (rules - page 2) end { if (list_src) ECHO; return(end); } read { if (list_src) ECHO; return(read); } begin { if (list_src) ECHO; return(beginsy); } write { if (list_src) ECHO; return(write); } integer { if (list_src) ECHO; return(integer); } program { if (list_src) ECHO; return(program); } writeln { if (list_src) ECHO; return(writeln); } variable { if (list_src) ECHO; return(variable); }
ATL/0 scanner in lex (rules - page 3) \<-- { if (list_src) ECHO; return(assign); } [a-z][a-z0-9_]* { if (list_src) ECHO; yylval.s_value = strdup(yytext); return(id); } [0-9]+ { if (list_src) ECHO; yylval.s_value = strdup(yytext); return(const); }
ATL/0 scanner in lex (rules - page 4). { if (list_src) ECHO; if (yytext[0] < ) yyerror ("illegal character: ^%c",yytext[0] + @ ); else if (yytext[0] > ~ ) yyerror ("illegal character: \%3d", (int) yytext[0]); else yyerror ("illegal character: %s",yytext); }
ATL0 scanner in lex (subroutines) #ifdef TESTSCAN YYSTYPE yylval; int yyparse() { int val; line_no = 1; list_src = 0; while ( (val = yylex())!= 0 ) printf ("val = %d yytext = %s \n", val, yytext); } #endif ( use "make testscan" in atl1 directory )
More about FLEX patterns -- Flex matches the longest sequence of characters that it can x match the character x. any character (byte) except newline [xyz] a "character class"; in this case, the pattern matches either an x, a y, or a z [abj-oz] a "character class" with a range in it; matches an a, a b, any letter from j through o, or a Z [^A-Z] a "negated character class", i.e., any character but those in the class. In this case, any character EXCEPT an uppercase letter. [^A-Z\n] any character EXCEPT an uppercase letter or a newline
More about FLEX patterns (page 2) r* zero or more r s, where r is any regular expression r+ one or more r s r? zero or one r s (that is, "an optional r") r{2,5} anywhere from two to five r s r{2,} two or more r s r{4} exactly 4 r s {name} the expansion of the "name" definition (definitions explained in a couple of pages.) "[xyz]\"foo" the literal string: [xyz]"foo \X if X is an a, b, f, n, r, t, or v, then the ANSI-C interpretation of \x. Otherwise, a literal X (used to escape operators such as * )
More about FLEX patterns (page 3) \0 a NUL character (ASCII code 0) \123 the character with octal value 123 \x2a the character with hexadecimal value 2a (r) match an r; parentheses are used to override precedence rs the regular expression r followed by the regular expression s; called "concatenation" r s either an r or an s Precedence: (Highest to lowest) groups -- [xyz] *, +,?, r{..} -- r* concatenation -- rs union -- r s foo ba[rz]* => (foo) (ba(([rz])*))
More about FLEX patterns (page 4) r/s an r but only if it is followed by an s. ^r an r, but only at the beginning of a line r$ an r, but only at the end of a line (i.e., just before a newline). Equivalent to "r/\n". <<EOF>> matches the end of the file (Flex only.) Definitions: (in definition section) DIGIT [0-9] Use: (in regular expressions) {DIGIT}+("."{DIGIT}+)?
Start States Method to allow only a few rules to apply at a time %x xyz /* Exclusive start state, declaration part */ Use in rule part: xyz { BEGIN(xyz); } <xyz>r1 { action... } <xyz>r2 { action... } <xyz>r3 { BEGIN(INITIAL); } /* Revert to using initial start state */ INITIAL is value 0.
Start State Example -- C comments %x comment %% "/*" { BEGIN(comment); } <comment>[^*\n] /* eat it! */ <comment>["*"+[^*\n] /* eat it! */ <comment>[\n { line_no++; } <comment>"*"+"/" { BEGIN(INITIAL); )
Running lex / flex file extension usually.l "flex scan.l" => lex.yy.c "flex -oscan.c scan.l" => scan.c With yacc... "yacc -d parse.y" => y.tab.c, y.tab.h y.tab.h - token definitions for scanner. y.tab.c - C code for parser that calls scanner.
Other considerations... Reserved words Reserved vs. Restricted Part of IDs and then use table lookup? Compiler Control, e.g. pragmas Conditional Compilation -- C uses #ifdef Source Listings -- not as often now Symbol Table entry Some scanners enter names in a table String tables..
Other considerations... (page 2) Inclusion of other files? (#include "file") Multi-character lookahead DO 10 I = 1,100 (Fortran) DO 10 I = 1.100 arrayname length (Ada) a Non-regular structures ATL/1 nested comments use variables in scanner! flex -- use different start states
Lexical Errors Delete all characters -- start again? Delete first character -- start again? How about <- in ATL? (This is about matching... not errors!) How about beg#in? Flex:. { Generate an error... "eat 1 char"}
ATL/1 Scanner notes: ( ATL1.notes ) 1) Comments start with (* and end at the MATCHING *). 2) An IDENTIFIER is a string from [A-Za-z][A-Za-z0-9_]* Case is important. "Aname" is different from "AName". 3) Look like IDs. ALL capitalizations of a reserved word is the same reserved word. For example, BeGiN, begin, Begin and so forth are all the same reserved word, BEGIN. The reserved words are: DO IF IS OF OR AND END NOT ELSE THEN TYPE ARRAY BEGIN ELSIF UNTIL VALUE WHILE REPEAT RETURN RETURNS PROGRAM VARIABLE FUNCTION PROCEDURE (Note: The word BEGIN is reserved in flex. Therefore, yacc and lex use BEGINSY to refer to the BEGIN reserved word in ATL/1.)
ATL/1 Scanner notes (page 2): 4) A STRING starts with a double quote (") and ends with a double quote. Strings may not cross the line breaks. Strings may have "quoted" characters in the string. They are \b for backspace, \f for formfeed, \n for newline, \r for carriage return, \t for tab, \" is the double quote character and \\ for the backslash character. An arbitrary character can be specified by \nnn notation where nnn is a decimal value less than 256. (Your scanner does not have to translate strings, the strings can be copied directly to the assembler. The hcas assembler uses the exact same escape sequences. The primary recognition issue for your scanner is the \".)
ATL/1 Scanner notes (page 3): 5) An INT_CONST is a string of digits ([0-9]+). 6) A MUL_OP is one of: * / mod (mod is like a reserved word even though it is not returned as a ID token. Any capitalization of mod is still mod.) 7) A REL_OP is one of: =!= < <= > >= 8) An ASSIGN is: <--