Week 3 Lexical Analysis Tasks of Lexical Analysis Why separating lexical analysis and parsing? Tokens, Patterns and Lexemes Complex tokens like identifier and numeral are described using regularexpression notations. Attributes for Tokens: provide actual value for a lexeme (a pointer to an entry in symbol table). A symbol table may contain word symbols, names, and numerals. Buffering Techniques: Buffer Pairs and Sentinels Buffer Paris: a buffer contains 2 N-character halves. Each system read system call brings N characters into buffer. Two pointers are maintained: forward and lexeme_ning. Drawback: can not recognize token with length more than N. i = i + 1 ; e n d EOF if forward at of first half then reload second half; forward := forward + 1 else if forward at of second half then reload first half; move forward to ning of first half else forward := forward +1 Sentinels: reduce two comparisons to on e. i = i + 1 ; EOF e n d EOF forward := forward +1 if *forward = EOF then if forward at of first half then reload second half; forward := forward + 1 else if forward at of second half then reload first half; move forward to ning of first half else terminate lexical analysis 1
Token Specification Strings and Languages A language denotes any set of strings over some fixed alphabet. The empty set or the set that contains empty string are languages. Operations on Languages: union, concatenation, and closure. Regular Expressions Regular expressions over alphabet sigma is defined as follows: 1) epsilon is a regular expression that denotes {epsilon 2) if a is a symbol in alphabet, then a is a regular expression that denotes {a. 3) suppose r and s are regular expressions denoting the languages L(r) and L(s). then a. (r) (s) is a regular expression denoting L(r) U L(s). b. (r) (s) is a regular expression denoting L(r)L(s). c. (r)* is a regular expression denoting (L(r))*. d. (r) is a regular expression denoting L(r). A language denoted by a regular expression is said to be a regular set. Regular Definitions A regular definition over sigma is a sequence of definitions of the form d1 -> r1 d2 -> r2 dn -> rn where each di is a distinct name and each ri is a reqular expression over the symbols in sigma U {d1, d2,, di-1. Non-regular sets Some language can not be described by any regular expression. 1) Balanced or nested construct: e.g., the set of all strings of balanced parentheses can not be described by a regular expression. However, it can be specified by a context-free grammar. 2) Repeating strings cannot be described by regular expressions. {wcw w is a string of a s and b s can not be described by any regular expression, nor can it be specified by a context-free grammar. Recognition of Tokens Consider the Pascal- grammar. The terminals program, ;, etc., generate sets of strings given by the following regular definitions. (add Line 41 and 42 to recognize names and numbers where digit = {0, 1, 2,, 9 and letter = {A, B,, Z, a, b,, z) Numeral -> digit+ 2
Name -> letter (letter digit)* program -> program ; -> ; const -> const TypeName -> Name VariableName -> Name ProcedureName -> Name FieldName -> Name ConstantName -> Name We also assume lexemes are separated by white space, consisting of nonnull sequences of blanks, tabs and newlines. The lexical analyzer will do so by comparing a string against the regular definition ws: delim -> blank tab newline ws -> delim+ Our goal is to isolate the lexeme for the next token in the input buffer and produce as output a pair consisting of the appropriate token and attribute-value using the translation table: Regular Expression Token Attribute Value Ws - - Numeral Numeral pointer to table entry Name Name pointer to table entry program program PROGRAM ; Semicolon SEMICOLON const const CONST TypeName Name Pointer to table entry VariableName Name Pointer to table entry ProcedureName Name Pointer to table entry FieldName Name Pointer to table entry ConstantName Name Pointer to table entry The PROGRAM, SEMICOLON, CONST are predefined constants; the pointer to table entry may be the index of the table. Transition Diagrams: Start state, accepting state (double cirle), and asterisk (retract forward pointer). Each edge is associated with a character. If failure occurs in all transition diagrams, a lexical error has been detected and an errorrecovery routine is invoked. For performance consideration, looking for frequently occurring tokens before less frequently occurring ones, e.g., white space. 3
other 4 start [ 0 { ] 1 2 return(bar,bar) delimiter other 3 * return(leftsquare, LSQUARE) A technique for separating word symbols and names is to initialize a symbol table with word symbols. Implementing a Transition Diagram: Procedure Begin c := If c = [] then c := switch (c) { case ] : return(bar, BAR); case c in delimiter: while c in delimiter c:= case { : while c!= c := default: retract(); return(leftsquare, LSQUARE); ; if; Procedure nexttoken() skipseparator(); switch (ch) { 4
; case letter: scanword(); case digit: scannumeral(); case + : scanplus(); case EOF: return(eof); procedure skipseparator() while ch in delimiter and { do switch (ch) { case space: case newline: lineno++; case tab: case { : comment(); ; procedure comment() while ch!= do if ch = { then comment(); else if ch = newline then lineno++; if ch = then else error( comment ); ; Searching Linear search Letter indexing Hashing Direct chaining Linear Probing 5