We use L i to stand for LL L (i times). It is logical to define L 0 to be { }. The union of languages L and M is given by

Size: px

Start display at page:

Download "We use L i to stand for LL L (i times). It is logical to define L 0 to be { }. The union of languages L and M is given by"

Madlyn Patrick
6 years ago
Views:

1 The term languages to mean any set of string formed from some specific alphaet. The notation of concatenation can also e applied to languages. If L and M are languages, then L.M is the language consisting of all string xy, which can e found y selecting a string x from L, and a string y from M, and concatenating them in that order. That is, LM= {xy x is in L and y in M} we call LM the concatenation of L and M. Example: Let L e {0, 01,110}, and let M e {10,110}. Then LM= {010, 0110, 01110, 11010, }. Is the concatenation operator w1 =fire, w2 =truck w1 w2 =firetruck w2 w1 =truckfire w2 w2 =trucktruck Often drop the : w1w2 =firetruck For any string w, wɛ = w We can concatenate languages as well as strings L1L2 = {wv : w L1 and v L2} {a,a}{,}={a,a,a} {a,a}{a,a}={aa,aa,aa,aa} {a,aa}{a,aa}={aa,aaa,aaaa} We use L i to stand for LL L (i times). It is logical to define L 0 to e { }. The union of languages L and M is given y L M = {x x is in L or x is in M}. The empty set,, is the identity under union, since And L=L =L L=L = 15

2 There is another operation on languages which plays an important role in specifying tokens. This is the kleen closure operator. We use L * to denote the concatenation of language L with itself any numer of times. L * = L i i=0 Example Let D e the language consisting of the string 0, 1 9, that is, each string is a single decimal digit. Then D * is all strings of digits, including the empty string. For example, if L= {aa}, then L * is all string of an even numer of a's, since L 0 = { }, L 1 = {aa}, L 2 = {aaaa},.... If we wished to exclude, we could write L.(L * ), to denote that language. That is:- L.(L * ) =L. L i = L i+1 = L i i=0 i=0 i=1 We shall often use the L * for L.(L * ). The unary postfix operator + is called positive closure, and denotes "one or more instances of". A simple Approach to the Design of Lexical Analyzers There are two primary methods for implementing a scanner. The first is a program that is hard-coded to perform the scanning tasks. The second uses regular expression and finite automata theory to model the scanning process. One way to egin the design of any program is to descrie the ehavior of the program y a flowchart. This approach is particularly useful when the program is a lexical analyzer, ecause the action taken is highly dependent on what characters have een seen recently. Rememering previous characters y the position in a flowchart is a valuale tool, so much so that a specialized kind of flowchart for lexical analyzer, called a transition diagram, has evolved. In a transition diagram, the oxes of the flowchart are drawn as circles and called states. The states are connected y arrow, called edges. The laels on the various edges leaving a state indicate the input characters that can appear after that state. Identifier letter {letter digit} * digit [0-9] letter [A-Z a-z] 16

3 Start Letter Fig. 6: Transition diagram for identifier Fig. 6 shows a transition diagram for an identifier, defined to e a letter followed y any numer of letters or digits. The starting state of the transition diagram is state 0, the edge from which indicates that the first input character must e a letter. If this is the case, we enter state 1 and look at the next input character if this is a letter or the digit, we continue this way, reading letters and digits, and making transition from state 1 to itself, until the next input characters is a delimiter for an identifier, which we have assume is any character that is not a letter or a digit. On reading the delimiter, we enter state 2. To turn a collection of transition diagram into a program, we construct a segment of code for each state. The first step to e done in the code for any state is to otain the next character from the input uffer. For this purpose we use a function GETCHAR, which returns the next character, advancing the lookahead pointer at each call. The next step is to determine which edge, if any, out of the state is laeled y a character or class of characters that includes the character just read. If such an edge is found, control is transferred to the state pointed to y that edge. If no such edge is found, and the state is not one which indicated that a token has een found (indicated y a doule circle), we have fail to find this token. The lookahead pointer must e retracted to where the eginning pointer is, and another token must e searched for, using another transition diagram. If all transition diagrams have een tried without success, a lexical error has een detected, and an error correction routine must e called. Consider the transition diagram in Fig. 6, the code for state 0 might e:- State 0: C: = GETCHAR (); If LETTER(C) then goto state 1 else FAIL () Here, LETTER is a procedure which returns true if and only if C is a letter. Fail() is a routine which retracts the lookahead pointer and starts up the next transition diagram, if there is one, or calls the error routine. The code for state 1 is: State 1 C:=GETCHAR (); if LETTER (C) or DIGIT (C) then goto state 1 else if DELIMITER(C) then goto state 2 else FAIL () 17 Letter or digit Delimiter *

4 DIGIT is a procedure which returns true if and only if C is one of the digits 0, 1 9. DELIMITER is a procedure which returns true whenever C is a character that could follow an identifier. If we define a delimiter to e any character that is not letter or digit, then the clause "if DELIMITER (C) then", need not e presented in state 1. To detect errors more effectively we might define a delimiter precisely (e.g., lank, arithmetic or logical operator, left or right parenthesis, equal sign, colon, semicolon, or comma), depending on the language eing compiled. State 2 indicates that an identifier has een found. Since the delimiter is not part of the identifier, we must retract the lookahead pointer one character, for which we use a procedure RETRACT. We use '*' to indicate states on which input retraction must take place. We must also install the newly-found identifier in the symol tale if it is not already there, using the procedure INSTALL *. In state 2 we return a pair consisting of the integer code for an identifier, which we denote y id, and a value that is a pointer to the symol tale returned y INSTALL. The code for state 2 is: State 2: RETRACT ( ) return (id, INSTALL ( )) If lank must e skipped in the language at hand, we should include in the code for state 2 a step that moved the eginning pointer to the next non-lank. Fig. 7 shows a list of tokens that we want to recognize using token recognizer that use transition diagram explained in Fig. 8. Token Code Value egin end if then else identifier 6 Pointer to Symol Tale constant 7 Pointer to Symol Tale < 8 1 <= 8 2 = 8 3 <> 8 4 > 8 5 >= 8 6 Fig. 7: Token Recognizer 18

5 Keywords: Blank or Start B E G I N newline * Blank or newline E N D * return (2,) return (1,) Blank or newline 14 L S E * return (5,) I Blank or F newline * return (3,) Blank or newline T H E N * return (4,) Identifier: Start Not Letter Letter or digit * return (6,INSTALL ()) Constant: Letter or digit Start Digit Not Digit * return (7,INSTALL ()) Digit 19

6 Re lops: not Start < = or > * return (8,1) = 32 return (8,2) > 33 return (8,4) = 34 return (8,3) > 35 not = * 36 return (8,5) = 37 return (8,6) Fig. 8: transition Diagram A more efficient program can e constructed from a single transition diagram than from a collection of diagrams, since there is no need to acktrack and rescan using a second transition diagram. In Fig. 8, we have comined all keywords into one transition diagram. However, if we attempt to comine the diagram for identifiers with that for keywords, difficulties arise. For example, one seeing the three letters BEG, we could not tell whether to e in state 3 or state 24. In Fig. 8, each keyword is treated as a separate token, whereas all relops are comine into one token class, with the associated token value distinguishing one relops from another. Let us now consider an example if the action of the lexical analyzer constructed from the transition diagram of Fig.8. On seeing IFA followed y a lank, the 20

7 lexical analyzer would traverse state 0, 15, and 16, then fail and retract the input to I. It would then startup the second transition diagram at state 23, traverse state 24 three times, go to state 25 on the lank, retract the input one position, install IFA in the symol tale. Definition of Regular Expression After the definition of the string and languages, we are ready to descrie regular expressions, the notation we shall use to define the class of languages known as regular sets. Recall that a token is either a single string (such as a punctuation symol) or one of a collection of string of a certain type (such as an identifier). If we view the set of strings in each token class as a language, we can use the regularexpression notation to descrie tokens. In regular expression notation we could write the definition for identifier as:- Identifier= letter (letter digit) * The vertical ar means "or" that is union, the parentheses are used to group su expressions, and the star is the closure operator meaning "zero or more instances". What we call the regular expression over alphaet are exactly those expressions that can e constructed from the following rules. Each regular expression denotes a language and we gives the rules for construction of the denoted languages along with the regular-expression construction rules. 1- Is a regular expression denoting { }, that is, the language consisting only the empty string. 2- For each a in, a is a regular expression denoting {a}, the language with only one string, that string consisting of the single symol a. 3- If R and S are regular expression denoting language L R and L S, respectively, then:- i) (R) (S) is a regular expression denoting L R U L S ii) (R). (S) is a regular expression denoting L R. L S iii) (R) * is a regular expression denoting L * R We have shown regular expression formed with parentheses whenever possile. In fact, we eliminate them when we can, using the precedence rules that * has highest precedence, then comes., and has lowest precedence. 21

8 Let us assume that our alphaet is {a, }. The regular expression a denotes {a}, which is different from just the string a. 1- The regular expression a * denotes the closure of the language {a}, that is a * =U{a i } The set of all strings of zero or more a's. The regular expression aa*, which y our precedence rules is parsed a(a)*, denote the strings of one or more a's. We may use a + for aa* 2- What does the regular expression (a )* denote? We see that a denotes {a, }, the language with two string a and. Thus (a )* denote U{a, } i Which is just the set of all string of a's and 's including the empty string. The regular expression (a**)* denote the same set. 3- The expression a a* is grouped a ( (a)*), and denotes the set of strings consisting of either a single "a" or "" followed y zero or more a's. 4- The expression aa a a denotes all strings of length two, so (aa a a )* denotes all strings of even length. Note that is a string of length zero. 5- a denotes strings of length zero or one. Example: The token discussed in Fig. 7, can e descried y regular expression as follows: Keyword=BEGIN END IF THEN ELSE Identifier=letter (letter digit)* Constant=digit* Relops= < <= = < > > >= Where letter stands for A B Z, and digit stands for If two regular expression R and S denote the same language, we write R=S, and say that R and S are equivalent. For example, we previously oserved that (a )*= (a**)*. For any regular expression R, S and T, the following axioms hold:- i=0 i=0 22

9 1- R S= S R ( is commutative) 2- R (S T)=(R S) T ( is associative) 3- R (ST) = (RS) T (. is associative) 4- R(S T) = RS RT and (S T) R= SR TR (. distriutes over 1) 5- R=R =R ( is the identity for concatenation) Finite Automata A recognizer for a language L is a program takes as input a string x and answer "yes" if x is a sentence of L on "no" otherwise. Clearly, the part of a lexical analyzer that identifies the presence of a token on the input is a recognized for the language defining that token. Suppose we have specific a language y a regular expression R, and we are given some string x. We want to know whether x is in the language L denoted y R. One way to attempt this test is to check that x can e decomposed into a sequence of sustrings denoted y the primitive su expressions in R. Suppose R is (a )*a, the set of all strings ending in a and x is the string aa. We see that R=R 1 R 2, where R 1 = (a )* and R 2 = a. We can verify that a is an element of the language denoted y R 1 and that a similarly match R 2. In this way, we show that a is in the language denoted y R. Nondeterministic Finite Automata (NFA) A etter way to convert a regular expression to a recognizer is to construct a generalized transition diagram from the expression. This diagram is called nondeterministic finite automata. A nondeterministic finite automata recognizing the language (a )*a is shown in Fig.9. a Start a Fig. 9: Nondeterministic Finite Automata The NFA is a laeled directed graph. The nodes are called states, and the laeled edges are called transitions. The NFA looks almost like a transition diagram, ut edges can e laeled y as well as characters, and the some character called lael 23

10 two or more transitions out of one state. One state (0 in Fig. 9) is distinguished as the start state, and one or more states may e distinguished as accepting states (or final states). State 3 in Fig. 9 is the final state. The transitions of an NFA can e conveniently represented in taular form y means of a transition tale. The transition tale for the NFA of Fig. 9 is shown in Fig. 10. In the transition tale there is a row for each state and a column for each input symol. The entry for row 1 and symol a is the set of possile next state for state 1 on input a State Input symol Fig.10: Transition Tale a The NFA accepts an input string x if and only if there is a path from the start state to some accepting state, such that laels along that path spell out x. If the input string is aa, then we can show this sequence of moves:- State Remaining input 0 aa 0 a In Fig.11 elow we can see an NFA to recognize aa* *. String aaa is accepted y going through states 0, 1, 2, 2, and 2. The laels of these edges are, a, a and a, whose concatenation is aaa. 0 {0,1} {0} {2} {3} a 1 a 2 Start Fig.11: NFA accepting aa* *. 24

1. Lexical Analysis Phase

1. Lexical Analysis Phase The purpose of the lexical analyzer is to read the source program, one character at time, and to translate it into a sequence of primitive units called tokens. Keywords, identifiers,