A Pascal program program xyz(input, output); var A, B, C: integer; begin A := B + C * 2 end. Input from the file is read to a buffer program buffer program xyz(input, output) --- begin A := B + C * 2 end. \0 OR input read char by char to a lexeme buffer Program header Declaration part Statement part BUT the next char after the lexeme has been read this character must be saved (may be beginning of next lexeme) xyz\0 ( this technique is used for Prolog & Lisp 1DFR - PL - Program
Program: textual view Sequence of characters from an alphabet White Space: Alphanumeric strings Numeric strings Other strings blank, tab, newline (ignored) (begin with a letter) (keywords / user defined ids) (begin with a number) (begin with non letter/number) 2DFR - PL - Program
Patterns Literal strings program, input, output, var, integer, begin, end, (,,, ), ;, :, :=, +, *,. Regular expressions Alphanumeric [a z,a Z][a z,a Z,0 9]* Numeric [0 9][0 9]* Matching via algorithms OR table lookup 3DFR - PL - Program
Lexeme Each substring in the program text input string which is matched by a pattern is called a LEXEME E.g. (keyword, user defined id, symbol, number) program xyz ( input, output ) ; var A, B, C : integer ; begin A := B + C * 2 end. Copy the lexeme from the input buffer to a lexeme buffer OR read the file char by char into the lexeme buffer BUT the next char after the lexeme has been read this character must be saved (may be beginning of next lexeme) 4DFR - PL - Program
Tokens Each LEXEME may be represented by a TOKEN (often a (symbolic) integer value to save space) A TOKEN represents a Class of Lexemes (often just 1 member) ID and NUMBER have a potentially infinite number of members program xyz ( input, output ) ; var A, B, C : integer ; begin A := B + C * 2 end. (lexemes to tokens) 261 257 40 262 44 263 41 59 OR use symbolic names program ID lparen input comma output 5DFR - PL - Program
Token Values In a language such as C the values may be defined as ASCII values (single character tokens) (0 255) Values > 256 (ID, NUMBER, assign, keywords) typedef enum tvalues // tokens + keywords { tstart=257, id, number, assign, predef, tempty, undef, error, typ, tend, kstart, program, input, output, var, begin, end, boolean, integer, real, kend } toktyp; 6DFR - PL - Program
The Parsing Process The role of the Parser is to determine if the input program is syntactically correct or not The role of the Lexer is to identify lexemes and convert them to tokens (as well as to remove white space) Program text (input string) Lexical Analysis Pattern matching Token stream Parsing Syntax Checks T/F 7DFR - PL - Program
Lexemes Tokens (via a table) Keyword table Token table lexeme token lexeme token program program id id input input number number output output := assign var var, comma integer integer ; semicolon begin begin + plus etc. etc. etc. etc. NB: id and number are pseudo-lexemes 8DFR - PL - Program
Tokens Lexemes (via a table) Often for debugging, it is useful to convert the tokens back to lexemes use the tables. (ID id ; number number ) 1. All identifiers map to the pseudo lexeme ID 2. For ID we have a {token, lexeme} tuple {ID, xyz } 3. Similarly all numbers map to NUMBER 4. Again we have a {token, lexeme} tuple {NUMBER, 2 } 5. We will use this in the Prolog and Lisp parsers (as a list) 6. In the C parser we have get_token() and get_lexeme() 7. This means that the actual values (lexemes) of IDs and NUMBERs must be saved 8. IDs Symbol Table 9DFR - PL - Program
Symbol Table Name Rôle Type Size Address (or offset) _predef type _predef 0 0 _undef type _predef 0 0 _error type _predef 0 0 integer type _predef 4 0 boolean type _predef 4 0 xyz program id _predef 12 9999 A variable integer 4 0 B variable integer 4 4 C variable integer 4 8 DFR - PL - Program 10
What s the difference? This may lead to some confusion Program program xyz (text string) Pattern program (text string) OR Pattern [a z,a Z][a z,a Z,0 9]* (regular expression) Lexeme program / xyz (sub string of program) Token program / id (symbolic name) OR Token 257 / 258 (integer value) Alphanumeric keyword or ID DFR - PL - Program keyword ID 11
Summary Source program is a text i.e. a string Pattern string or regular expression Lexeme substring of the program text Token class of lexemes (often just 1 member) Token representation as integers or symbolic names Lexer string token stream Parser token stream Boolean (T/F) DFR - PL - Program 12