CSC 467 Lecture 3: Regular Expressions

CSC 467 Lecture 3: Regular Expressions Recall How we build a lexer by hand o Use fgetc/mmap to read input o Use a big switch to match patterns Homework exercise static TokenKind identifier( TokenKind token ) { /* exercise for you did you do it?*/ int len; char* start = cursor; while( isletter(*cursor) isdigit(*cursor) ) cursor ++; token->kind = KIND_TOKEN_IDENTIFIER; len = cursor start; token->u.stringval = malloc( len +1 ); strncpy( token->u.stringval, start, len ); token->u.stringval[len] = \0 ; return token->kind; TokenKind gettoken( Token token ) { for( ; ; ) { c = *curser ++; switch( c ) { case a : case A : if( c[-1] == f && c[0] == o && c[1] == r && isblank(c[2]) ) { cursor += 2; return KIND_TOKEN_FOR; return identifier( token ); Today How can we build lexer systematically

Start by how to describe token patterns Regular Expressions The notation we use to precisely capture all the variations that a given category of token may take are called "regular expressions" (or, less formally, "patterns". The word "pattern" is really vague and there are lots of other notations for patterns besides regular expressions). Regular expressions are a shorthand notation for sets of strings. In order to even talk about "strings" you have to first define an alphabet, the set of characters which can appear. 1. Epsilon: 1. Notation: (ε) 2. Definition: { : is a regular expression denoting the set containing the empty string 2. Symbol: 1. a 2. { a : Any letter in the alphabet is also a regular expression denoting the set containing a one-letter string consisting of that letter. 3. Alteration: For regular expressions r and s, 1. r s 2. is a regular expression denoting the union of r and s 4. Concatenation: For regular expressions r and s, 1. r s 2. is a regular expression denoting the set of strings consisting of a member of r followed by a member of s 5. Repetition: For regular expression r, 1. r* 2. is a regular expression denoting the set of strings consisting of zero or more occurrences of r. Notation Sugar Although these operators are sufficient to describe all regular languages, in practice everybody uses extensions: You can parenthesize a regular expression to specify operator precedence (otherwise, alternation is like plus, concatenation is like times, and closure is like exponentiation) For regular expression r, r+ is a regular expression denoting the set of strings consisting of one or more occurrences of r. Equivalent to rr*

For regular expression r, r? is a regular expression denoting the set of strings consisting of zero or one occurrence of r. Equivalent to r ε The notation [abc] is short for a b c. [a-z] is short for a b... z. [^abc] is short for: any character other than a, b, or c. Example for (keyword) for letter [a-za-z] digit [0-9] identifier letter (letter digit)* sign + - ε integer sign (0 [1-9] digit*) decimal integer.digit* real (integer decimal) E sign digit* There is some ambiguity though: If the input includes the characters for8, then the first rule (for for-keyword) matches 3 characters (for), the fourth rule (for identifier) can match 1, 2, 3, or 4 characters, the longest being for8. To resolve this type of ambiguities, when there is a choice of rules, scanner generators choose the one that matches the maximum number of characters. In this case, the chosen rule is the one for identifier that matches 4 characters (for8). This disambiguation rule is called the longest match rule. If there are more than one rules that match the same maximum number of characters, the rule listed first is chosen. This is the rule priority disambiguation rule. For example, the lexical word for is For example, the lexical word for is taken as a for-keyword even though it uses the same number of characters as an identifier. lex(1) and flex(1) These programs generally take a lexical specification given in a.l file and create a corresponding C language lexical analyzer in a file named lex.yy.c. The lexical analyzer is then linked with the rest of your compiler. The C code generated by lex has the following public interface. Note the use of global variables instead of parameters, and the use of the prefix yy to distinguish scanner names from your program names. This prefix is also used in the YACC parser generator. FILE *yyin; /* set this variable prior to calling yylex() */ int yylex(); /* call this function once for each token */ char yytext[]; /* yylex() writes the token's lexeme to an array */ /* note: with flex, I believe extern declarations must read extern char *yytext; */ int yywrap(); /* called by lex when it hits end-of-file; see below */

The.l file format consists of a mixture of lex syntax and C code fragments. The percent sign (%) is used to signify lex elements. The whole file is divided into three sections separated by %%: %% %% header body helper functions The header consists of C code fragments enclosed in %{ and % as well as macro definitions consisting of a name and a regular expression denoted by that name. lex macros are invoked explicitly by enclosing the macro name in curly braces. Following are some example lex macros. letter [a-za-z] digit [0-9] ident {letter({letter {digit)* The body consists of of a sequence of regular expressions for different token categories and other lexical entities. Each regular expression can have a C code fragment enclosed in curly braces that executes when that regular expression is matched. For most of the regular expressions this code fragment (also called a semantic action consists of returning an integer that identifies the token category to the rest of the compiler, particularly for use by the parser to check syntax. Some typical regular expressions and semantic actions might include: " " { /* no-op, discard whitespace */ {ident { return IDENTIFIER; "*" { return ASTERISK; "." { return PERIOD; You also need regular expressions for lexical errors such as unterminated character constants, or illegal characters. The helper functions in a lex file typically compute lexical attributes, such as the actual integer or string values denoted by literals. One helper function you have to write is yywrap(), which is called when lex hits end of file. If you just want lex to quit, have yywrap() return 1. If your yywrap() switches yyin to a different file and you want lex to continue processing, have yywrap() return 0. The lex or flex library (-ll or -lfl) have default yywrap() function which return a 1, and flex has the directive %option noyywrap which allows you to skip writing this function. A Short Comment on Lexing C Reals C float and double constants have to have at least one digit, either before or after the required decimal. This is a pain: ([0-9]+.[0-9]* [0-9]*.[0-9]+)... You might almost be happier if you wrote ([0-9]*.[0-9]*) { return (strcmp(yytext,"."))? REAL : PERIOD;

You-all know C's ternary e1? e2 : e3 operator, don't ya? Its an if-then-else expression, very slick. Lex extended regular expressions Lex further extends the regular expressions with several helpful operators. Lex's regular expressions include: c normal characters mean themselves \c backslash escapes remove the meaning from most operator characters. Inside character sets and quotes, backslash performs C-style escapes. "s" Double quotes mean to match the C string given as itself. This is particularly useful for multi-byte operators and may be more readable than using backslash multiple times. [s] This character set operator matches any one character among those in s. [^s] A negated-set matches any one character not among those in s.. The dot operator matches any one character except newline: [^\n] r* match r 0 or more times. r+ match r 1 or more times. r? match r 0 or 1 time. r{m,n match r between m and n times. r 1 r 2 concatenation. match r 1 followed by r 2 r 1 r 2 alternation. match r 1 or r 2 (r) parentheses specify precedence but do not match anything r 1 /r 2 lookahead. match r 1 when r 2 follows, without consuming r 2 ^r match r only when it occurs at the beginning of a line r$ match r only when it occurs at the end of a line