PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer As the first phase of a compiler, the main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as output a sequence of tokens for each lexeme in the source program. The stream of tokens is sent to the parser for syntax analysis. It is common for the lexical analyzer to interact with the symbol table as well. When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the symbol table. In some cases, information regarding the kind of identifier may be read from the symbol table by the lexical analyzer to assist it in determining the proper token it must pass to the parser. The lexical analyzer to read characters from its input until it can identify the next lexeme and produce for it the next token, which it returns to the parser. Since the lexical analyzer is the part of the compiler that reads the source text, it may perform certain other tasks besides identification of lexemes. Another task is correlating error messages generated by the compiler with the source program. For instance, the lexical analyzer may keep track of the number of newline characters seen, so it can associate a line number with each error message. In some compilers, the lexical analyzer makes a copy of the source program with the error messages inserted at the appropriate positions. If the source program uses a macro-preprocessor, the expansion of macros may also be performed by the lexical analyzer. Sometimes, lexical analyzers are divided into a cascade of two processes: -> Lexical -> Analyzer a) Scanning consists of the simple processes that do not require tokenization of the input, such as deletion of comments and compaction of consecutive whitespace characters into one. b) Lexical analysis proper is the more complex portion, where the scanner produces the sequence of tokens as output. Principles of Compiler Design Unit2 1

Lexical Analysis Versus Parsing 1. Simplicity of design is the most important consideration. The separation of lexical and syntactic analysis often allows us to simplify at least one of these tasks. For example, a parser that had to deal with comments and whitespace as syntactic units would be considerably more complex than one that can assume comments and whitespace have already been removed by the lexical analyzer. If we are designing a new language, separating lexical and syntactic concerns can lead to a cleaner overall language design. 2. Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized techniques that serve only the lexical task, not the job of parsing. In addition, specialized buffering techniques for reading input characters can speed up the compiler significantly. 3. Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to the lexical analyzer. 2.1.1 Tokens, Patterns, and Lexemes A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or a sequence of input characters denoting an identifier. The token names are the input symbols that the parser processes. In what follows, we shall generally write the name of a token in boldface. We will often refer to a token by its token name. A pattern is a description of the form that the lexemes of a token may take. In the case of a keyword as a token, the pattern is just the sequence of characters that form the keyword. For identifiers and some other tokens, the pattern is a more complex structure that is matched by many strings. A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an ihstance of that token. Attributes for Tokens When more than one lexeme can match a pattern, the lexical analyzer must provide the subsequent compiler phases additional information about the particular lexeme that matched. For example, the pattern for token number matches both 0 and 1, but it is extremely important for the code generator to know which lexeme was found in the source program. Thus, in many cases the lexical analyzer returns to the parser not only a token name, but an attribute value that describes the lexeme represented by the token; the token name influences parsing decisions, while the attribute value influences translation of tokens after the parse. The tokens have at most one associated attribute, although this attribute may have a structure that combines several pieces of information. The most important example is the token id, where we need Principles of Compiler Design Unit2 2

to associate with the token a great deal of information. Normally, information about an identifier - e.g., its lexeme, its type, and the location at which it is first found is kept in the symbol table. Thus, the appropriate attribute value for an identifier is a pointer to the symbol-table entry for that identifier. 2.2 Input Buffering A two-buffer scheme that handles large look aheads safely. We then consider an improvement involving "sentinels" that saves time checking for the ends of buffers. Buffer Pairs Because of the amount of time taken to process characters and the large number of characters that must be processed during the compilation of a large source program, specialized buffering techniques have been developed to reduce the amount of overhead required to process a single input character. An important scheme involves two buffers that are alternately reloaded Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096 bytes. Using one system read command we can read N characters inio a buffer, rather than using one system call per character. If fewer than N characters remain in the input file, then a special character, represented by eof, marks the end of the source file and is different from any possible character of the source program. Two pointers to the input are maintained: I. Pointer lexemebegin, marks the beginning of the current lexeme, whose extent we are attempting to determine. 2. Pointer forward scans ahead until a pattern match is found; Sentinels The sentinel is a special character that cannot be part of the source program, and a natural choice is the character eof. It retains its use as a marker for the end of the entire input. Any eof that appears other than at the end of a buffer means that the input is at an end. Figure summarizes the algorithm for advancing forward. Notice how the first test, which can be part of a multiway branch based on the character pointed to by forward, is the only test we make, except in the case where we actually are at the end of a buffer or the end of the input. Principles of Compiler Design Unit2 3

2.3 Specification of Tokens Regular expressions are an important notation for specifying lexeme patterns. While they cannot express all possible patterns, they are very effective in spec ifying those types of patterns that we actually need for tokens. A string over an alphabet is a finite sequence of symbols drawn from that alphabet. In language theory, the terms "sentence" and "word" are often used as synonyms for "string." A language is any countable set of strings over some fixed alphabet. This definition is very broad. Abstract languages like 0, the empty set, or ( 1, the set containing only the empty string, are languages under this definition. Operations on Languages In lexical analysis, the most important operations on languages are union, concatenation, and closure. The concatenation of languages is all strings formed by taking a string from the first language and a string from the second language, in all possible ways, and concatenating them. The closure of a language L, denoted L*, is the set of strings you get by concatenating L zero or more times. Note that Lo, the "concatenation of L zero times," is defined to be {E), and inductively, L~ is Li-'L. Finally, the positive closure, denoted L+, is the same as the Kleene closure, but without the term Lo. That is, E will not be in L+ unless it is in L itself. Regular Expressions The regular expressions are built recursively out of smaller regular expressions, using the rules described below. Each regular expression r denotes a language L(r), which is also defined recursively from the languages denoted by r's subexpressions. Here are the rules that define the regular expressions over some alphabet C and the languages that those expressions denote. BASIS: There are two rules that form the basis: 1. E is a regular expression, and L (E) is {E), that is, the language whose sole member is the empty string. 2. If a is a symbol in C, then a is a regular expression, and L(a) = {a), that is, the language with one string, of length one, with a in its one position. Note that by convention, we use italics for symbols, and boldface for their corresponding regular expression.' Principles of Compiler Design Unit2 4

INDUCTION: There are four parts to the induction whereby larger regular expressions are built from smaller ones. Suppose r and s are regular expressions denoting languages L(r) and L(s), respectively. 1. (r)1 (9) is a regular expression denoting the language L(r) U L(s). 2. (r) (s) is a regular expression denoting the language L(r) L(s). 3. (r) * is a regular expression denoting (L (r)) *. 4. (r) is a regular expression denoting L(r). This last rule says that we can add additional pairs of parentheses around expressions without changing the language they denote. 2.4 Recognition of Tokens The language generated by the following grammar is used as an example of how to recognize the tokens is handled. Consider the following grammar fragment: stmt if expr then stmt if expr then stmt else stmt expr term relop term term term id num where the terminals if, then, else, relop, id and num generate sets of strings given by the following regular definitions: if if then ten else else relop < <= = <> > >= id letter(letter digit)* num digit + (.digit + )?(E(+ -)?digit + )? Principles of Compiler Design Unit2 5

For this language fragment the lexical analyzer will recognize the keywords if, then, else, as well as the lexemes denoted by relop, id, and num. To simplify matters, we assume keywords are reserved; that is, they cannot be used as identifiers. Unsigned integer and real numbers of Pascal are represented by num. In addition, we assume lexemes are separated by white space, consisting of nonnull sequences of blanks, tabs and newlines. It will do so by comparing a string against the regular definition ws, below. delim blank tab newline ws delim + If a match for ws is found, the lexical analyzer does not return a token to the parser. Rather, it proceeds to find a token following the white space and returns that to the parser. The attribute values for the relational operators are given by the symbolic constants LT,LE,EQ,NE,GT,GE. ws REGULAR EXPRESSION TOKEN ATTIBUTE VALUE if if then then else else id id Pointer to table entry num num Pointer to table entry < relop LT <= relop LE = relop EQ <> relop NE > relop GT >= relop GE Principles of Compiler Design Unit2 6

Transition diagram A transition diagram is a stylized flowchart. Transition diagram is used to keep track of information about characters that are seen as the forward pointer scans the input. We do so by moving from position to position in the diagrams as characters are read. Positions in a transition diagram are drawn as circles and are called states. The states are connected by arrow, called edges. Edges leaving state s have labels indicating the input characters that can next appear after the transition diagram has reached state s. the label other refers to any character that is not indicated by any of the other edges leaving s. One state is labeled as the start state; it is the initial state of the transition diagram where control resides when we begin to recognize a token. Certain states may have actions that are executed when the flow of control reaches that state. On entering a state we read the next input character if there is and edge from the current state whose label matches this input character, we then go to the state pointed to by the edge. Otherwise we indicate failure. A transition diagram for >= is shown in the figure. 2.5 A Language for Specifying Lexical Analyzers Lex A Lexical Analyzer Generator Lex helps write programs whose control flow is directed by instances of regular expressions in the input stream. It is well suited for editor-script type transformations and for segmenting input in preparation for a parsing routine. Lex source is a table of regular expressions and corresponding program fragments. The table is translated to a program which reads an input stream, copying it to an output stream and partitioning the input into strings which match the given expressions. As each such string is recognized the corresponding program fragment is executed. The recognition of the Principles of Compiler Design Unit2 7

expressions is performed by a deterministic finite automaton generated by Lex. The program fragments written by the user are executed in the order in which the corresponding regular expressions occur in the input stream. The lexical analysis programs written with Lex accept ambiguous specifications and choose the longest match possible at each input point. If necessary, substantial look-ahead is performed on the input, but the input stream will be backed up to he end of the current partition, so that the user has general freedom to manipulate it. Introduction. Lex is a program generator designed for lexical processing of character input streams. It accepts a high-level, problem oriented specification for character string matching, and produces a program in a general purpose language which recognizes regular expressions. The regular expressions are specified by the user in the source given to Lex. The Lex written code recognizes these expressions in an input stream and partitions the input stream into strings matching the expressions. At the boundaries between strings program sections provided by the user are executed. The Lex source file associates the regular expressions and the program fragments. As each expression appears in the input to the program written by Lex, the corresponding fragment is executed. Lex is not a complete language, but rather a generator representing a new language feature which can be added to different programming languages, called host languages. Lex can write code in different host languages. The host language is used for the output code generated by Lex and also for the program fragments added by the user. Compatible run-time libraries for the different host languages are also provided. This makes Lex adaptable to different environments and different users. Each application may be directed to the combination of hardware and host language appropriate to the task, the user s background, and properties of local implementations. Lex turns the user s expressions and actions (called source in this memo) into the host general-purpose language; the generated program is named yylex The yylex program will recognize expressions in a stream (called input in this memo) and perform the specified actions for each expression as it is detected. Principles of Compiler Design Unit2 8

Source Input ---> Lex yylex yylex Output An overview of Lex For a trivial example, consider a program to delete from the input all blanks or tabs at the ends of lines. %% [ \t]+$ ; is all that is required. The program contains a%% delimiter to mark the beginning of the rules, and one rule. This rule contains a regular which matches one or more instances of the characters blank rtab (written \t for visibility, in with the C language convention) just prior to the end of a line. The brackets indicate character class made of blank and tab; the + indicates one or more... ; and the $ indicates end of line, as in QED. No action is specified, so the program generated by Lex (yylex) will ignore these characters. Everything else will be copied. To change any remaining string of blanks or tabs to a single blank, add another rule: %% [ \t]+$ ; [ \t]+ printf (" "); The finite automaton generated for this source will scan for both rules at once, observing at the termination of the string of blanks or tabs whether or not there is a newline character, and executing the desired rule action. The first rule matches all strings of blanks or tabs at the end of lines, and the second rule all remaining strings of blanks or tabs. Lex can be used alone for simple transformations, or for analysis and statistics gathering on a lexical level. Lex can also be used with a parser generator to perform the lexical analysis phase; it is particularly easy to interface Lex and Yacc.Lex programs recognize only regular expressions; Yacc writes parsers that accept a large class of context free grammars, but require a lower level analyzer to recognize input tokens. Thus, a combination of Lex and Yacc is often appropriate. When used as a preprocessor for a later parser generator, Lex is used to partition the input stream, and the parser generator assigns structure to the resulting pieces. The flow of control in such a case (which might be the first half of a compiler, for example) is shown Principles of Compiler Design Unit2 9

below.additional programs, written by other generators or by hand, can be added easily to programs written by Lex.. lexical rules grammar rules Lex Input yylex Yacc yyparse Parsed input Lex with Yacc Yacc users will realize that the name yylex is what Yacc expects its lexical analyzer to be named, so that the use of this name by Lex simplifies interfacing. Lex generates a deterministic finite automation from the regular expressions in the source.the automaton is interpreted, rather than compiled, in order to save space. The result is still a fast analyzer. In particular, the time taken by a Lex program to recognize and partition an input stream is proportional to the length of the input. The number of Lex rules or the complexity of the rules is not important in determining speed, unless rules which include forward context require a significant amount of rescanning. What does increase with the number and complexity of rules is the size of the finite automaton, and therefore the size of the program generated by Lex. In the program written by Lex, the user s (representing the actions to be performed as each regular expression is found) are gathered as cases of a switch. The automaton interpreter directs the control flow. Opportunity is provided for the user to insert either declarations or additional statements in the routine containing the actions, or to add subroutines outside this action routine. Lex is not limited to source which can be interpreted on the basis of one character look-ahead. For example, if there are two rules, one looking for ab and another for abcdefg, and the input stream is abcdefh, Lex will recognize ab and leave the input pointer just before cd Such is more costly than the processing of languages. Principles of Compiler Design Unit2 10

Lex Source. General format of Lex source is: {definitions} %% {rules} %% {user subroutines} where the definitions and the user subroutines are often omitted. The second % % is optional, but the first is required to mark the beginning of the rules. The absolute minimum Lex program is %% (no definitions, no rules) which translates into a program which copies the input to the output unchanged. In the outline of Lex programs shown the rules represent the user s control signs; they are a table, in which the left column contains regular expressions and the right column contains actions to be executed when the expressions are recognized. Thus an individual rule might appear integer printf("found keyword INT"); to look for the string integer in the input stream and print the message found keyword INT whenever it appears. In this example the host procedural language is C and the C library function printf is used to print the string. The end of the expression is indicated by the first blank or tab character. If the action is merely a single C expression, it can just be given on the right side of the line; if it is compound, or takes more than a line, it should be enclosed in braces. As a slightly more useful example, suppose it is desired to change a number of words from British to American spelling. Lex rules such as colour printf ("color"); Mechanise printf ("mechanize"); petrol printf("gas"); would be a start. Principles of Compiler Design Unit2 11

Lex Regular Expressions. A regular expression specifies a set of strings to be matched. It contains text characters (which match the corresponding characters in the strings being compared) and operator characters (which specify repetitions, choices, and other features). The letters of the alphabet and the digits are always text characters; thus the regular expression integer matches the string integer wherever it appears and the expression a57d looks for the string a57d Metacharacter Matches. any character except newline \n newline * zero or more copies of preceding expression + one or more copies of preceding expression? zero or one copy of preceding expression ^ beginning of line $ end of line a b (ab)+ "a+b" a or b one or more copies of ab (grouping) literal a+b (C escapes still work) [ ] character class Expression Matches abc abc abc* ab, abc, abcc, abccc, abc+ abc, abcc, abccc, a(bc)+ abc, abcbc, abcbcbc, a(bc)? a, abc [abc] a, b, c Principles of Compiler Design Unit2 12

[a-z] any letter, a through z [a\-z] a, -, z [-az] -, a, z [A-Za-z0-9]+ one or more alphanumeric characters [ \t\n]+ whitespace [^ab] anything except: a, b [a^b] a, ^, b [a b] a,, b a b a or b name function int yylex(void) char *yytext yyleng yylval int yywrap(void) FILE *yyout FILE *yyin INITIAL BEGIN ECHO call to invoke lexer, returns token pointer to matched string length of matched string value associated with token wrap-up, return 1 if done, 0 if not done output file input file initial start condition condition switch start condition write matched string 2.6 Finite Automata 1. Finite automata are recognizers; they simply say "yes" or "no" about each possible input string. 2. Finite automata come in two flavors: (a) Nondeterministic finite automata (NFA) have no restrictions on the labels of their edges. A symbol can label several edges out of the same state, and E, the empty string, is a possible label. Principles of Compiler Design Unit2 13

(b) Deterministic finite automata (DFA) have, for each state, and for each symbol of its input alphabet exactly one edge with that symbol leaving that state. Both deterministic and nondeterministic finite automata are capable of recognizing the same languages. In fact these languages are exactly the same languages, called the regular language. Nondeterministic Finite Automata A nondeterministic finite automaton (NFA) consists of: 1. A finite set of states S. 2. A set of input symbols C, the input alphabet. We assume that E, which stands for the empty string, is never a member of C. 3. A transition function that gives, for each state, and for each symbol in C U (E) a set of next states. 4. A state so from S that is distinguished as the start state (or initial state) 5. A set of states F, a subset of S, that is distinguished as the accepting states (or final states). We can represent either an NFA or DFA by a transition graph, where the nodes are states and the labeled edges represent the transition function. There is an edge labeled a from state s to state t if and only if t is one of the next states for state s and input a. This graph is very much like a transition diagram, except: a) The same symbol can label edges from one state to several different states, and b) An edge may be labeled by c, the empty string, instead of, or in addition to, symbols from the input alphabet. Example : The transition graph for an NFA recognizing the language of regular expression (ajb)*abb is shown in Fig. Transition Tables Represent an NFA by a transition table, whose rows correspond to states, and whose columns correspond to the input symbols and c. The entry for a given state and input is the value of the transition function applied to those arguments. If the transition function has no information about that state-input pair, we put Q) in the table for the pair. Example: The transition table for the NFA Principles of Compiler Design Unit2 14

Acceptance of Input Strings by Automata An NFA accepts input string x if and only if there is some path in the transition graph from the start state to one of the accepting states, such that the symbols along the path spell out x. Note that c labels along the path are effectively ignored, since the empty string does not contribute to the string constructed along the path. Deterministic Finite Automata A deterministic finite automaton (DFA) is a special case of an NFA where: 1. There are no moves on input E, and 2. For each state s and input symbol a, there is exactly one edge out of s labeled a. If we are using a transition table to represent a DFA, then each entry is a single state. we may therefore represent this state without the curly braces that we use to form sets. While the NFA is an abstract representation of an algorithm to recognize the strings of a certain language, the DFA is a simple, concrete algorithm for recognizing strings. It is fortunate indeed that every regular expression and every NFA can be converted to a DFA accepting the same language, because it is the DFA that we really implement or simulate when building lexical analyzers. The following algorithm shows how to apply a DFA to a string. Algorithm : Simulating a DFA. INPUT: An input string x terminated by an end-of-file character eof. A DFA D with start state so, accepting states F, and transition function moue. OUTPUT: Answer ''yes" if D accepts x; "no" otherwise. METHOD: Apply the algorithm in Fig. 3.27 to the input string x. The function moue(s, c) gives the state to which there is an edge from state s on input c. The function next Char returns the next character of the input string x. Principles of Compiler Design Unit2 15

From Regular Expressions to Automata The regular expression is the notation of choice for describing lexical analyzers and other patternprocessing software.however, implementation of that software requires the simulation of a DFA, as in Algorithm 3.18, or perhaps simulation of an NFA. Because an NFA often has a choice of move on an input symbol or on e or even a choice of making a transition on E: or on a real input symbol, its simulation is less straightforward than for a DFA. Thus often it is important to convert an NFA to a DFA that accepts the same language. In this section we shall first show how to convert NFA's to DFA's. Then, we use this technique, known as "the subset construction," to give a useful algorithm for simulating NFA's directly, in situations (other than lexical analysis) where the NFA-to-DFA conversion takes more time than the direct simulation. Next, we show how to convert regular expressions to NFA's, from which a DFA can be constructed if desired. We conclude with a discussion of the time-space tradeoffs inherent in the various methods for implementing regular expressions, and see how to choose the appropriate method for your application. Conversion of an NFA to a DFA The general idea behind the subset construction is that each state of the constructed DFA corresponds to a set of NFA states. After reading input ala2 - -. a,, the DFA is in that state which corresponds to Principles of Compiler Design Unit2 16

the set of states that the NFA can reach, from its start state, following paths labeled ala2.. an. It is possible that the number of DFA states is exponential in the number of NFA states, which could lead to difficulties when we try to implement this DFA. However, part of the power of the automatonbased approach to lexical analysis is that for real languages, the NFA and DFA have approximately the same number of states, and the exponential behavior is not seen. Algorithm : The subset construction of a DFA from an NFA. OUTPUT: A DFA D accepting the same language as N. METHOD: Our algorithm constructs a transition table Dtran for D. Each state of D is a set of NFA states, and we construct Dtran so D will simulate "in parallel" all possible moves N can make on a given input string. Our first problem is to deal with e-transitions of N properly. In Fig we see the definitions of several functions that describe basic computations on the states of N that are needed in the algorithm. Note that s is a single state of N, while T is a set of states of N. Construction of an NFA from a Regular Expression We now give an algorithm for converting any regular expression to an NFA that defines the same language. The algorithm is syntax-directed, in the sense that it works recursively up the parse tree for the regular expression. For each subexpression the algorithm constructs an NFA with a single accepting state. Principles of Compiler Design Unit2 17

Algorithm : The McNaughton-Yamada-Thompson algorithm to convert a regular expression to an NFA. INPUT: A regular expression r over alphabet C. OUTPUT: An NFA N accepting L(r). METHOD: Begin by parsing r into its constituent subexpressions. The rules for constructing an NFA consist of basis rules for handling subexpressions with no operators, and inductive rules for constructing larger NFA's from the NFA's for the immediate subexpressions of a given expression. 2.7 Design of a lexical analyzer A lexical analyzer generator creates a lexical analyser using a set of specifications usually in the format p 1 {action 1 } p 2 {action 2 }............ p n {action n } where p i is a regular expression and each action action i is a program fragment that is to be executed whenever a lexeme matched by p i is found in the input. If more than one pattern matches, then longest lexeme matched is chosen. If there are two or more patterns that match the longest lexeme, the first listed matching pattern is chosen. This is usually implemented using a finite automaton. There is an input buffer with two pointers to it, a lexeme-beginning and a forward pointer. The lexical analyser generator constructs a transition table for a finite automaton from the regular expression patterns in the lexical analyser generator specification. The lexical analyser itself consists of a finite automaton simulator that uses this transition table to look for the regular expression patterns in the input buffer. This can be implemented using an NFA or a DFA. The transition table for an NFA is considerably smaller than that for a DFA, but the DFA recognises patterns faster than the NFA. Using NFA The transition table for the NFA N is constructed for the composite pattern p 1 p 2... p n, The NFA recognises the longest prefix of the input that is matched by a pattern. In the final NFA, there is Principles of Compiler Design Unit2 18

an accepting state for each pattern p i. The sequence of steps the final NFA can be in is after seeing each input character is constructed. The NFA is simulated until it reaches termination or it reaches a set of states from which there is no transition defined for the current input symbol. The specification for the lexical analyser generator is so that a valid source program cannot entirely fill the input buffer without having the NFA reach termination. To find a correct match two things are done. Firstly, whenever an accepting state is added to the current set of states, the current input position and the pattern p i is recorded corresponding to this accepting state. If the current set of states already contains an accepting state, then only the pattern that appears first in the specification is recorded. Secondly, the transitions are recorded until termination is reached. Upon termination, the forward pointer is retracted to the position at which the last match occurred. The pattern making this match identifies the token found, and the lexeme matched is the string between the lexeme beginning and forward pointers. If no pattern matches, the lexical analyser should transfer control to some default recovery routine. Using DFA Here a DFA is used for pattern matching. This method is a modified version of the method using NFA. The NFA is converted to a DFA using a subset construction algorithm. Here there may be several accepting states in a given subset of nondeterministic states. The accepting state corresponding to the pattern listed first in the lexical analyser generator spec Summary Tokens. The lexical analyzer scans the source program and produces as output a sequence of tokens, which are normally passed, one at a time to the parser. Some tokens may consist only of a token name while others may also have an associated lexical value that gives information about the particular instance of the token that has been found on the input. Lexernes. Each time the lexical analyzer returns a token to the parser, it has an associated lexeme - the sequence of input characters that the token represents. Buffering. Because it is often necessary to scan ahead on the input in order to see where the next lexeme ends, it is usually necessary for the lexical analyzer to buffer its input. Using a pair of buffers cyclicly and ending each buffer's contents with a sentinel that warns of its end are two techniques that accelerate the process of scanning the input. Principles of Compiler Design Unit2 19

Patterns. Each token has a pattern that describes which sequences of characters can form the lexemes corresponding to that token. The set of words, or strings of characters, that match a given pattern is called a language. Regular Expressions. These expressions are commonly used to describe patterns. Regular expressions are built from single characters, using union, concatenation, and the Kleene closure, or any-number-of, operator. Transition Diagrams. The behavior of a lexical analyzer can often be described by a transition diagram. These diagrams have states, each of which represents something about the history of the characters seen during the current search for a lexeme that matches one of the possible patterns. There are arrows, or transitions, from one state to another,each of which indicates the possible next input characters that cause the lexical analyzer to make that change of state. Finite Automata. These are a formalization of transition diagrams that include a designation of a start state and one or more accepting states, as well as the set of states, input characters, and transitions among states. Accepting states indicate that the lexeme for some token has been found. Unlike transition diagrams, finite automata can make transitions on empty input as well as on input characters. Deterministic Finite Automata. A DFA is a special kind of finite automaton that has exactly one transition out of each state for each input symbol. Also, transitions on empty input are disallowed. The DFA is easily simulated and makes a good implementation of a lexical analyzer, similar to a transition diagram. Nondeterministic Finite Automata. Automata that are not DFA7s are called nondeterministic. NFA's often are easier to design than are DFA's. Another possible architecture for a lexical analyzer is to tabulate all the states that NFA7s for each of the possible patterns can be in, as we scan the input characters. Key Terms >> preprocessors >> linkers >> loaders >> Tokens >> Buffering >> Patterns >> Transition Diagrams >> Finite automata >> NFA >> DFA >> Regular Expressions >> assemblers Key Term Quiz 1. The behavior of a lexical analyzer can often be described by ---------------------- 2. ----------------------that include a designation of a start state and one or more accepting states, as well as the set of states, input characters, and transitions among states. Principles of Compiler Design Unit2 20

3. ----------------- produce input to compilers. 4. --------------- is a program that translates a symbolic version of an instruction into the binary version. 5. A utility program that sets up an executable program in main memory ready for execution is called ----------- 6. The utility program that combines several separately compiled modules into one, resolving internal references between them is called -------------- 7. A ---------- is a pair consisting of a token name and an optional attribute value 8. A ------------- is a description of the form that the lexemes of a token may take. 9. A ------------- is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an ihstance of that token. 10. ------------------technique has been developed to reduce the amount of overhead required to process a single input character. Multiple Choice Questions 1. The utility program that combines several separately compiled modules into one, resolving internal references between them is called a. preprocessors b. assemblers c. loaders d. linkers 2. A --------- is a pair consisting of a token name and an optional attribute value a. pattern b. token c. lexeme d. input buffer 3. A -------------- is a description of the form that the lexemes of a token may take. a. pattern b. token c. lexeme d. input buffer 4. A ------------ is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token. a. pattern b. token c. lexeme d. input buffer Review Questions Two mark Questions 1. Write short notes on buffer pair. 2. What is the role of lexical analysis? 3. How to construct NFA from regular expression? 4. Define tokens, lexemes, patterns. 5. What are the attributes of tokens. 6. Define transition diagram. 7. What is meant by finite automata. Principles of Compiler Design Unit2 21

8. Compare NFA and DFA. 9. Define input buffering. 10. Write an algorithm to convert NFA to DFA. 11. Define ε-closure? 12. What are the drawbacks of using buffer pairs? Big Questions 1. Construct the NFA from the (a/b)*a(a/b). 2. Explain about input buffering technique. 3. Give the minimized DFA for the following expression (a/b)*abb. 4. Draw the transition diagram for unsigned numbers. 5. Explain specification of tokens. 6. Construct DFA directly from regular expression (a b)*abb without constructing NFA. Lesson Labs Exercise 1 : Describe the languages denoted by the following regular expressions: Exercise 2: Write regular definitions for the following languages. (i) All strings of letters that contain the five vowels in order. (ii) All strings of letters in which the letters are in ascending lexicographic order. (iii) All strings of digits with no repeated digit. (iv) All strings of digits with at most on repeated digit. (v) All strings of 0 s and 1 s with an even number of 0 s and 1 s and an odd number of 1 s. (vi) All strings of 0 s and 1 s that do not contain the substring 011. (vii) All strings of 0 s and 1 s that do not contain the subsequences 011. Exercise 3: Construct NFA for the following. ----- END OF SECOND UNIT ---------- Principles of Compiler Design Unit2 22