Parsing and Pattern Recognition

Topics in IT 1 Parsing and Pattern Recognition Week 10 Lexical analysis College of Information Science and Engineering Ritsumeikan University 1

this week mid-term evaluation review lexical analysis its place in typical compiler architecture semantic type vs. semantic value tokenisation tools and examples 2

compiler architecture: lexical analysis source file for (i = 0; i < 10; ++i) printf("%i\n", i); text tokeniser (lexical analyser) FOR LPAREN ID<i> EQ NUM<0> SEMI... tokens parser (syntax analyser) optimiser = FOR i 0 tree code generator movl $0, 24(%esp) assembly language executable file binary 3

tokenisation also known as lexical analysis scanning file or terminal input text = regular expressions C code fragments buffer character sequence string matching rules rule actions allocate & initialise = tokens parser 4

lexical analysis a grammar defines the structure of sentences of a language categories (ID, NUM,...) represent roles, ignoring specific values e.g., foo, bar and baz are all ID, regardless of their name %token FOR LPAREN RPAREN SEMI EQ ID NUM statement = FOR LPAREN statement SEMI expression SEMI expression RPAREN statement... expression = ID EQ expression NUM... this is sufficient to recognise if a sentence is grammatical however, a parser does care about specific values! ID foo is not the same as ID bar,... 5

semantic types and values a token combines a category and a specific value (when appropriate) category ID NUM BINARYOP value char *name int value <=, +, =, etc... with values, we can analyse the semantics (meaning) of a program the two parts of a token are therefore called semantic type (identifier, number, binary operator, etc...) semantic value (foo, 123, ADD, etc...) 6

tokens semantic types can be represented by unique values (e.g., integers) enum { // semantic types ID, INT, FLOAT, // variables and literals UNYOP, BINOP, // unary and binary operators LPAREN, RPAREN, // punctuation... }; type of the semantic value is often dependent on the semantic value enum { ADD, SUB, MUL, DIV, MOD,... }; // operators struct token { int semantic_type; union { // semantic values char *id_name; // ID long integer_value; // INT double float_value; // FLOAT int operator; // BINOP, UNYOP, etc. } semantic_value; }; 7

problem: tokenisation identify lexemes in the source code of a program punctuation, keywords, identifiers, numbers, etc. solution: use regular expressions to describe what each looks like convert the regular expressions into a DFA accept whenever an entire lexeme has been read construct a token and return it to the parser extended regular expressions: [abcde] a b c d e character set [a-dp-s] [abcdpqrs] character range. any character wildcard 8

if else do while for break continue return tokenisation language keyword ; language punctuation ( ) [-+]?[0-9]+ signed decimal integer [a-za-z_][a-za-z_0-9]* identifier [ \t\n\r] blank ( white space ) 9

lex scanner generator lex scanner generator automates: buffering and sequencing of input text creating a FSA from regular expressions scanning the input characters using the FSA recognising semantic types and values executing user-supplied actions to create tokens supplying tokens one at a time to a client (e.g., a parser) definitions regular expressions actions scanner.l lex lex.yy.c lex.yy.c cc a.out text a.out tokens 10

three sections: lex scanner specification C declarations and named REs named REs can be referred to as {name} RE rules and associated actions actions can be enclosed in {... } braces supporting C functions can be called from within actions lex converts specification into C program lex.yy.c lex.yy.c compiled (with parser, etc.) to create compiler front-end default action of lex.yy.c is to echo characters as they are read lex can be used to make simple text filters, word counters, etc. 11

lex scanner specification %{ /* declarations */ enum { FOR ID INTEGER FLOAT EQ LPAREN RPAREN SEMI }; Symbol *intern(const char *string); %} spaces [ \t\n]+ letter [A-Za-z] digit [0-9] id {letter}({letter} {digit})* integer {digit}+(\.{digit}+)?(e[+-]?{digit}+)? float {digit}+\.{digit}+(e[+-]?{digit}+)? %% /* rules and actions */ {spaces} { /* ignored */ } for { return FOR; } {id} { yylval.symbol = intern(yytext); return ID; } {integer} { yylval.integer_val = atoi(yytext); return INTEGER; } {float} { yylval.float_val = atof(yytext); return FLOAT; } "=" { return EQ; } %% /* supporting functions */ Symbol *intern(const char *string) {... } 12

examples (available for download from the course website) echo.l unspace.l startstop.l wordnum.l wc.l config.l config2.l config.txt default is to echo characters matched characters are not echoed actions are attached to matching patterns actions are attached to matching patterns EOF can be matched too can easily scan configuration files, etc. yytext contains the matched text (example input for config and config2) to compile on Mac, Linux, or Cygwin (Windows): lex filename.l ; cc -o filename lex.yy.c 13

symbols and symbol tables identifiers are often treated specially the same names reappear very many times wasteful to allocate a new string for each inefficient to compare identifiers using string comparison type, defined value (of symbolic constants), etc. identifiers converted into symbols a symbol is a unique string (maybe with other information) stored in a symbol table (binary tree, hash table,...) identifier names lookuped up in the table during scanning if found, existing symbol reused otherwise new symbol created symbols compared by identity (not equality of contents) provides a place to store additional information about identifiers 14

examples tokenise.l tokenise2.l tokens made from type + value yylex() and yylval provide tokens ordered tree of symbols is created previously-created symbols are always reused symbols can be compared by identity 15

deterministic FSA (DFA) is used lex implementation very fast: table lookup used to perform transitions immediately current state next character next state NFA constructed from regular expression rules DFA constructed from NFA no need for separate finite-choice matching of keywords DFA is faster than a series of strcmp()s DFA tables rapidly grow quite large trivial languages have hundreds of states 128 (ASCII) or 256 (UTF-8) characters per state table compression algorithm can be used to minimise size 16

ambiguity between rules lex complications the longest matching rule is always preferred if two rules match the same input characters, the one occuring first in the specification is preferred need for trailing context sometimes reserved words must occur in groups if any word is missing from the group, the words are identifiers instead the right context operator / provides for this, e.g: IF/.*THEN { return IF; } (input after the / must be matched, but is not consumed) coupling between parser and lexical analyser is sometimes needed in C, typedefed names are reserved words (not identifiers)! the symbol table provides a place in which this communication can take place 17

lex complications modal treatment of characters, e.g., C strings C compilers warn of string constants that span lines the interpretation of \n changes within a string constant two ways to handle this; first: let the action consume input characters, storing them in a buffer explicitly check for un-escaped \n tedious and error-prone or, second: temporarily put lex into a mode where \n becomes illegal \" { BEGIN str; } <str>\n { error("end of line in string"); } <str>\" { BEGIN 0; return STRING; } 18

homework and next week homework: read slides learn vocabulary practice using lex download the examples from the course website compile and run them next week: we now have tokens, so... let s turn them into a parse tree recursive-descent parsing 19

glossary action user-supplied code executed when a sequence of characters has been recognised. In lexical analysis, actions typically contruct and return a token. In syntactic analysis, actions typically construct a parse tree node. identity a property of an item that allows it to be identified uniquely and compared for equality. The literal value of a scalar quantity, or the memory address of an aggregate structure, typically serve as their identity. Two such items can be compared in a single operation (without having to compare the contents of the aggregate structure, for example). lex a program that generates scanners from a high-level description based on regular expressions. mode in lex, a state in which a different set of rules and patters are temporarily in effect. Scanning a string, for example, might put the scanner into a mode where newline characters are not allowed. 20

reserved word a token that is reserved by the programming languages. For example, in C the tokens for, while and if obey the rules for identifiers but cannot be used as identifiers since they are reserved words that give structure to the program. (In C, identifiers that have been defined as type names with typedef are treated as reserved words.) scanner another name for a lexical analyser: a program that converts a sequence of symbols (typically text characters) into tokens that represent the semantic quantities (identifiers, numbers, punctuation symbols) of the language that is being parsed. scanner generator a program that generates a scanner from a high-level description, often written as a set of regular expressions that describe the tokens to be produced when the generated scanner is run. 21

scanning the process of converting a sequence of symbols (typically text characters) into tokens that represent the semantic quantities (identifiers, numbers, punctuation symbols) of the language that is being parsed. semantic type the category to which a token belongs, often associated with a single terminal symbol (parentheses, arithmetic operators, statement terminators, etc.) or a class of related terminal symbols that have identical semantic behaviour (identifiers, literals, etc.). semantic value the actual value of a token, implied by the text that matched the token during scanning. For example, a token whose semantic type is integer might have the semantic value 37, or a token of type identifier might have a semantic value of tempvar. 22

symbol an object representing a name (such as an identifier) whose identity is guaranteed to be unique for any given value. Symbols can be compared using equality (on their memory address, for example) instead of having to perform a more expensive compaison of the characters in the associated name. For example, every occurence of the identifier xyz in a program would typically be scanned as the same, unqique symbol object. table lookup finding a value by indexing a table. The lookup is performed in constant time: no search needs to be performed. token an object or value represting a single semantic item in a language. For example identifiers, integers and the various arithmetic operators symbols of a language are typically represented as single tokens (even though they are written using more than one character). A token is often made from two properties: the type of the token (indicating the role it plays in the language sich as integer, identifier, multiplication operator, etc.) and its value (if any, such as the numeric value of an integer or the symbol associated with an identifier). 23