Chapter 2 Lexical Analysis Lexical analysis or scanning is the process which reads the stream of characters making up the source program from left-to-right and groups them into tokens. The lexical analyzer takes a source program as input and produces a stream of tokens as output. The lexical analyzer might recognize particular instances of tokens called lexemes. A token can then be passed to next phase of compiler i.e. syntax analysis. It is general for a lexical analyzer to interact with the symbol table as well. When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the symbol table. In some cases, the information concerning the kind of identifier may be read from symbol table by the lexical analyzer to assist it in determining the suitable token it must pass to the parser. Figure 2.1 Shows the role of a lexical analyzer. Figure 2.1: Role of Lexical Analyzer 2.1 Constituents of Lexical Analysis Token: A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or 8
a sequence of input characters denoting an identifier. The token names are the input symbols that the parser processes. Some examples of tokens in C are: Keywords (e.g.int, while), Identifiers (e.g. rate, total), Constants (e.g. 10, 2.5), Strings (e.g. total, hello ), Special symbols (e.g. ( ), { }), Operators (e.g. +, /, -, *). Pattern: A pattern is a description of the form that the lexemes of a token may take. In case of a keyword as a token, the pattern is just the sequence of characters that forms the keyword. For identifiers and some other tokens, the pattern is a more complex structure that is matched by many strings. Lexemes: A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token. Table 2.1 shows the examples of tokens, patterns and lexemes used in C language. Table 2.1: Example of Token, Pattern, Lexemes Token Lexeme Pattern ID x y n0 letter followed by letters and digits NUM -123 1.456e-5 any numeric constant IF if if LPAREN ( ( RPAREN ) ) LITERAL Hello any string of characters For example if we consider a C statement printf( Final = %d, Number); both printf and Number are lexemes matching the pattern for token ID, and Final = %d is a lexeme matching LITERAL. ( and ) match with token LPAREN and RPAREN respectively. The lexical analyzer must provide the additional information about the particular lexeme, when more than one lexeme matches a pattern. The lexical analyzer returns not only a token name, but also an attribute value that describes the lexeme represented by the token to the subsequent compiler phases. The token name influences parsing decisions, while the attribute value influences translation of tokens after the parse. For C statement printf( Final = %d, Number); the tokens returned would be: <ID,1><LPAREN><LITERAL><,><ID,2><RPAREN><;> Here, more than one identifier are discovered so to differentiate, a numeric value is assigned to tokens. 2.2 Input Buffering There are three general approaches to the implementation of a lexical analyzer: 9
1. Use a lexical-analyzer generator, such as Lex compiler to produce the lexical analyzer from a regular expression based specification. In this, the generator provides routines for reading and buffering the input. 2. Write the lexical analyzer in a conventional systems-programming language, using I/O facilities of that language to read the input. 3. Write the lexical analyzer in assembly language and explicitly manage the reading of input. Because of the amount of time taken to the large number of characters that must be processed during the compilation of a large source program, specialized buffering techniques have been developed to reduce the amount of overhead required to process a single input character. Two important techniques of buffering are described below: 2.2.1 Buffer Pairs In this technique two pointers to the input are maintained. First Pointer Lexeme Begin marks the beginning of the current lexeme, whose extent we are attempting to determine. while second pointer Forward scans ahead until a pattern match is found. Once the next lexeme is determined, forward is set to character at its right end. Then, after the lexeme is recorded as an attribute value of a token returned to the parser, Lexeme Begin is set to the character immediately after the lexeme just found. 2.2.2 Sentinels If we use the idea of Buffer pairs we must make sure each time we advance forward, that we have not moved off one of the buffers; if we do, then we must also reload the other buffer. Thus, for each character read, we make two tests: one for the end of the buffer, and one to determine what character is read. We can combine the buffer-end test with the test for the current character if we extend each buffer to hold a sentinel character at the end. The sentinel is a special character that cannot be part of the source program, and a natural choice is the character EOF. Note that EOF retains its use as a marker for the end of the entire input. Any EOF that appears other than at the end of a buffer means that the input is at an end. 2.3 Token Specification The Patterns corresponding to a token are generally specified using a compact notation known as regular expression. Regular expressions of a language are created by combining members of its alphabet. A regular expression r corresponds to a set of strings L(r) where L(r) is called 10
a regular set or a regular language and may be infinite. A regular expression is defined as follows: A basic regular expression a denotes the set {a} where a Σ; L(a) = {a} The regular expression ɛ denotes the set {ɛ} Technically, regular expression ɛ is different from string ɛ. Here ɛ represents null. If r and s are two regular expressions denoting the sets L(r) and L(s) then; following are some rules for regular expressions R1. r s is a regular expression denoting the union set: L(r) L(s) R2. rs is a regular expression denoting the concatenation set: L(r)L(s) R3. r is a regular expression denoting the Kleene closure set: L(r) R4. (r) is a regular expression denoting the set L(r) Following are some examples of regular expressions: 0 1 denotes the set {0, 1} as per rule 1. 0 denotes the set {ɛ, 0, 00, 000, 0000,... } as per rule 3. (0 1)(0 1) denotes the set {00, 01, 10, 11} as per rule 1. (0 1) denotes the set {ɛ, 0, 1, 00, 01, 10, 11, 000, 001,... } as per rule 1,3. 0 0 1 denotes the set {0, 1, 01, 001, 0001,... } as per rule 1,3. 2.3.1 Regular Definition We may assign a name to a regular expression to use and reuse the name in other (more complex) regular expressions and to enhance the readability of longer regular expressions. Suppose, following regular definition definitions are given: digit = [0 9], This will represent the number in the range from 0 through 9. letter = [A Za z], Shows any letter between capital A through Z and small a through z. eol = [\n] neol = [ˆ\n] We can use these regular definitions to write complex regular expressions, for example, 11
Integer_Literal = digit+ Fixed_Point_Literal = digit+. digit+ Floating_Point_Literal = digit+. digit+(e E)(+ -)?digit+ Identifier = letter(letter digit)* 2.4 Token Recognition The previous section described about tokens specification of a language using compact nation called regular expression. This section will elaborate how to construct recognizers that can identify the tokens occurring in input stream. These recognizers are known as Finite Automata. A Finite Automaton (FA) consists of: A finite set of states A set of transitions (or moves) between states: The transitions are labeled by characters form the alphabet A special start state A set of final or accepting states A finite automaton to represent is shown below in Figure 2.2. Identifier = letter(letter digit) Figure 2.2: A finite automata for Identifier 2.4.1 Deterministic Finite Automata(DFA) A Deterministic Finite Automaton (DFA) is a 5-tuple M = (Q, Σ,δ, S, F) consisting of: 1. A finite set of states Q 12
2. Finite set of input symbols Σ 3. A transition function δ : Q Σ Q 4. A start state S Q 5. A set of accepting states F Q A DFA takes an input string w over the alphabet Σ, and either accepts or rejects the string. Identifying acceptance with the value 1 and rejection with 0, one can think of a DFA as a machine that takes a string w as input, and outputs a single bit b {0, 1}. DFA be represented by a transition table T which is indexed by state S and input character c. T [s][c] is the next state to visit from state S if the input character is c. T can also be described as a transition function T : S Σ S maps the pair (S, c) to next_s. DFA and transition table for a C comment are show in Figure 2.3 and Table 2.2. Blank entries in the table represent an error state. A full transition table will contain one column for each character (may waste space). The characters are combined into character classes when treated identically in a DFA. Figure 2.3: DFA for C Comments Table 2.2: Transition Table for C Comments State / * other 1 2 2 3 3 3 4 4 4 5 4 3 5 13
2.4.2 Non-Deterministic Finite Automata (NFA) An NFA is a 5-tuple M = (Q, Σ, δ, S, F ) consisting of: 1. A finite set of states Q 2. Finite set of input symbols Σ 3. A transition function δ : Q (Σ {}) P (Q) 4. A start state S Q 5. A set of accepting states F Q The only difference between a DFA and an NFA is in the transition function δ. This is exactly the same as the definition of NFA. We proceed to define its computations using the same style as for DFAs. An NFA is similar to a DFA except that multiple transitions labeled by same character from same state are allowed, ɛ -transitions are allowed and ɛ -transitions are spontaneous. They occur without consuming any character. Figure 2.4 and Figure 2.5 show DFA and NFA for operators. Figure 2.4: DFA of Relational Operators Figure 2.5: NFA for Relational Operators 14
2.5 Lexical Analyzer Generator Lexical Analyzer Generator or Scanner Generator generates lexical analyzers which can be used to perform scanning of a file. Lex and Flex are two most popular scanner generators available in UNIX and Linux platforms. They take as input specification of requirements in the form of regular expressions and generate C code to do the lexical analysis of the file supplied as input i.e. it generates a lexical analyzer. Figure 2.6 shows the working of lex/flex and Figure 2.7 gives a general template to write lex/flex specifications. Figure 2.6: Working of Lex/Flex Figure 2.7: Lex/Flex Specification Template 2.5.1 Definition Section This section defines header files to import in code, macros basic declaration of variables, functions, keywords, special patterns etc. This will be copied to generated C file. We include following code in our definition section: #include<stdio.h> int vowels=0; int cons=0; 15
2.5.2 Rule Section This section deals with regular expression patterns with language statements. When the scanner matches a pattern in the input file with the declared pattern, it will execute the code associated with the pattern. Based on pattern declared in definition section we have defined the following rules for patterns: [aeiouaeiou] {vowels++;} The above rule means that whenever any vowel comes increment vowel count. [a-za-z] {cons++;} The above rule means means that whenever any consonant comes increment consonant count. 2.5.3 User Subroutines This section contains main function, definition of function declared in definition section and other relevant C code. These statements are directly copied to the generated source file. The execution of statements and calling of function is done by rules written in rule section. main() { printf( Enter the string.. at end press ˆd ); yylex(); printf( No of vowels=%d No of consonants=%d, vowels, cons); } When lex compiles the input specifications, it generates the C file lex.yy.c that contains the routine yylex(). This routine reads the input and tries to match it with any of the token patterns specified in the rules section. On a match, the associated action is executed. If there is more than one match, the action associated with the pattern that matches more text (included context) is executed. If still there are two or more patterns that match the same amount of text, the action associated with the pattern listed first in the specification file is executed. If no match is found, the default action is executed. The input text (lexeme) associated with the recognized token is placed in the global variable yytext. The detailed description of using lex/flex compiler is given in Appendix A. 16
Example: To count the number of vowels and consonants in a given string. %{ #include<stdio.h> int vowels=0; int cons=0; %} %% [aeiouaeiou] vowels++; [a-za-z] cons++; %% int yywrap() { return 1; } main() { printf( Enter the string.. at end press ˆd ); yylex(); printf( No of vowels=%dno of consonants=%d,vowels,cons); } By using the approach described in this Chapter lexical analyzer can be designed to perform specific task of lexical analysis. 17