Compiler construction in4020 lecture 2 Overview Koen Langendoen s elft University of Technology The Netherlands Generating a lexical analyzer generic methods specific tool lex Token Lex to recognize integers (f)lex: for UNIX C code an integer is a nonzero sequence of s optionally followed by a letter denoting the base class (b for binary and o for octal). format of the lex input file: definitions regular s base [bo] integer + base? rule = expr + action %{ #include "lex.h" %} base [bo] [09] rules user code regular expressions + actions auxiliary Ccode {} signal application of a {}+ {base}? {return INTEGER;} Lex resulting Ccode automatic generation char yytext[]; /* representation */ int yylex(void); /* returns type of next */ wrapper function to add attributes \n {line_number++;} void get_next_(void) { Token.class = yylex(); if (Token.class == 0) { Token.class = EOF; Token.repr = "<EOF>"; return; } Token.pos.line_number = line_number; Token.repr = strdup(yytext); } finite state automaton s 1
Finitestate automaton FSA examples Recognize input character by character Transfer between states integral_number [09]+ FSA Initial state set of accepting states transition function: State x Char State i f fixed_point_number [09]* [09]+ integral_number [09]+ fixed_point_number [09]* [09]+ recognize both s in one pass integral_number [09]+ fixed_point_number [09]* [09]+ naïve approach: merge initial states FSA implementation: transition table integral_number [09]+ fixed_point_number [09]* [09]+ correct approach: share common prefix transitions concurrent recognition of integers and fixed point numbers state character dot other recognized integer fixed point 2
FSA exercise (6 min.) Answers draw an FSA to recognize integers base [bo] integer + base? draw an FSA to recognize the regular expression (a b)*bab Automatic generation: FSA otted items start with initial set () of all s to be recognized for each character (ch) find the set (S ch ) of s that can start with ch extend the FSA with transition (,ch, S ch ) repeat adding transitions (to S ch ) until no new set is generated keeping track of matched characters in a : T R input regular expression α β already matched T α β still to be matched Types of dotted items Character moves shift item: dot in front of a basic pattern if i f if i f identifier [az] [az09]* input T α c β c reduce item: dot at the end if i f identifier [az] [az09]* nonbasic item: dot in front of repeated pattern or parenthesis identifier [az] [az09]* input c T αc β T α c β T α [class] β T α. β c c class T αc β T α[class] β T α. β 3
T α (R)? β T α(r )? β T α (R)* β T α(r )* β T α(r)? β T α( R)? β T α(r)? β T α(r)* β T α( R)* β T α(r)* β T α( R)* β T α (R)+ β T α(r )+ β T α (R 1 R 2 ) β T α(r 1 R 2 ) β T α( R)+ β T α(r)+ β T α( R)+ β T α( R 1 R 2 ) β T α(r 1 R 2 ) β T α(r 1 R 2 ) β a state corresponds to a set of basic items a character move yields a new set expand nonbasic items into basic items using see if the resulting set was produced before, if not introduce a new state add transition Example s integer: I ()+ fixedpoint: F ()* ()+ initial state I ()+ F ()* ()+ I ( )+ F ( )* ()+ F ()* ()+ Example Exercise (7 min.) character moves I ( )+ F ( )* ()+ F ()* ()+ I ()+ )+ F I ( ( )+ )* ()+ F ()* )* ()+ F ( )* ()+ draw the FSA (with item sets) for recognizing an identifier: identifier letter (letter_or or_und* letter_or_+)? extend the above FSA to recognize the keyword if as well. F ()* ( ()+ F ()* ()+ )+ F ()* ( )+ if i f 4
Answers Transition table compression redundant rows empty transitions state i f character L U recognized identifier keyword if row displacement Summary: generating a lexical analyzer Homework tool: lex s + actions wrapper interface dotted items character moves s study sections 2.1.10 2.1.12 lexical identification of s symbol tables macro processing print handout lecture 3 [blackboard] find a partner for the practicum register your group send email to koen@pds.twi.tudelft.nl 5