Definition of Regulr Expression After the definition of the string nd lnguges, we re redy to descrie regulr expressions, the nottion we shll use to define the clss of lnguges known s regulr sets. Recll tht token is either single string (such s punctution symol) or one of collection of string of certin type (such s n identifier). If we view the set of strings in ech token clss s lnguge, we cn use the regulr-expression nottion to descrie tokens. s:- In regulr expression nottion we could write the definition for identifier Identifier= letter (letter digit) * The verticl r mens "or" tht is union, the prentheses re used to group su expressions, nd the str is the closure opertor mening "zero or more instnces". Wht we cll the regulr expression over lphet Σ re exctly those expressions tht cn e constructed from the following rules. Ech regulr expression denotes lnguge nd we gives the rules for construction of the denoted lnguges long with the regulr-expression construction rules. 1- ε Is regulr expression denoting {ε}, tht is, the lnguge consisting only the empty string. 2- For ech in Σ, is regulr expression denoting {}, the lnguge with only one string, tht string consisting of the single symol. - If R nd S re regulr expression denoting lnguge L R nd L S, respectively, then:-
i) (R) (S) is regulr expression denoting L R U L S ii) (R). (S) is regulr expression denoting L R. L S iii) (R) * is regulr expression denoting L * R We hve shown regulr expression formed with prentheses whenever possile. In fct, we eliminte them when we cn, using the precedence rules tht * hs highest precedence, then comes., nd hs lowest precedence. Let us ssume tht our lphet Σ is {, }. The regulr expression denotes {}, which is different from just the string. 1- The regulr expression * denotes the closure of the lnguge {}, tht is * =U{ i } The set of ll strings of zero or more 's. The regulr expression *, which y our precedence rules is prsed ()*, denote the strings of one or more 's. We my use + for * i=0 2- Wht does the regulr expression ( )* denote? We see tht denotes {, }, the lnguge with two string nd. Thus ( )* denote U{, } i i=0 Which is just the set of ll string of 's nd 's including the empty string. The regulr expression (**)* denote the sme set. - The expression * is grouped ( ()*), nd denotes the set of strings consisting of either single "" or "" followed y zero or more 's.
4- The expression denotes ll strings of length two, so ( )* denotes ll strings of even length. Note tht ε is string of length zero. 5- ε denotes strings of length zero or one. Exmple: The token discussed in fig (5), cn e descried y regulr expression s follows: Keyword=BEGIN END IF THEN ELSE Identifier=letter (letter digit)* Constnt=digit* Relops= < <= = < > > >= Where letter stnds for A B Z, nd digit stnds for 0 1 9. If two regulr expression R nd S denote the sme lnguge, we write R=S, nd sy tht R nd S re equivlent. For exmple, we previously oserved tht ( )*= (**)*. For ny regulr expression R, S nd T, the following xioms hold:- 1- R S= S R ( is commuttive) 2- R (S T)=(R S) T ( is ssocitive) - R (ST) = (RS) T (. is ssocitive) 4- R(S T) = RS RT nd (S T) R= SR TR (. distriutes over 1) 5- εr=rε=r (ε is the identity for conctention) Finite Automt
A recognizer for lnguge L is progrm tkes s input string x nd nswer "yes" if x is sentence of L on "no" otherwise. Clerly, the prt of lexicl nlyzer tht identifies the presence of token on the input is recognizer for the lnguge defining tht token. Suppose we hve specific lnguge y regulr expression R, nd we re given some string x. We wnt to know whether x is in the lnguge L denoted y R. One wy to ttempt this test is to check tht x cn e decomposed into sequence of sustrings denoted y the primitive su expressions in R. Suppose R is ( )*, the set of ll strings ending in nd x is the string. We see tht R=R 1 R 2, where R 1 = ( )* nd R 2 =. We cn verify tht is n element of the lnguge denoted y R 1 nd tht similrly mtch R 2. In this wy, we show tht is in the lnguge denoted y R. Non Deterministic Automt A etter wy to convert regulr expression to recognizer is to construct generlized trnsition digrm from the expression. This digrm is clled nondeterministic finite utomt. A nondeterministic finite utomt recognizing the lnguge ( )* is shown in fig (7). Strt 0 1 2 The NFA is leled directed grph. The nodes re clled sttes. nd the leled edges re clled Fig (7) A nondeterministic Finite Automt trnsitions. The NFA looks lmost like trnsition
digrm, ut edges cn e leled y ε s well s chrcters, nd the some chrcter cl lel two or more trnsitions out of one stte. One stte (0 in fig (7)) is distinguished s the strt stte, nd one or more sttes my e distinguished s ccepting sttes (or finl sttes). Stte in fig (7) is the finl stte. The trnsitions of n NFA cn e conveniently represented in tulr form y mens of trnsition tle. The trnsition tle for the NFA of fig (7) is shown in fig (8). In the trnsition tle there is row for ech stte nd column for ech input symol. The entry for row 1 nd symol is the set of possile next stte for stte 1 on input Stte Input symol Fig (8) Trnsition Tle The NFA ccepts n input string x if nd only if there is pth from the strt stte to some ccepting stte, such tht lels long tht pth spell out x. If the input string is, then we cn show this sequence of moves:- 0 {0,1} {0} 1 ---- {2} 2 ---- {} Stte Remining input 0 0 1 2 Fig (9) ε
In Fig (9) elow we cn see n NFA to recognize * *. String is ccepted y going through sttes 0, 1, 2, 2, nd 2. The lels of these edges re ε,, nd, whose conctention is. ε 1 2 Strt 0 ε 4 Fig (9) NFA ccepting * *. Deterministic Automt The NFA shown in fig (8) hs more thn one trnsition from stte 0 on input, tht is, it my go to stte 0 or 1. Similrly, the NFA of fig (9) hs two trnsitions on ε from stte 0. These situtions re the reson why it is hrd to simulte n NFA with computer progrm. The deterministic finite utomt hs t most one pth from the strt stte leled y ny string. The finite utomton is deterministic if 1- It hs no trnsitions on input ε 2- For ech stte s nd input symol, there is t most one edge leled leving s.
Exmple: in fig (10) elow we see deterministic finite utomt (DFA) ccepting the lnguge ( )*, which is the sme lnguge s tht ccepted y the NFA of fig (7) Strt 0 1 2 Fig (10) DFA ccepting ( )* Since there is t most one trnsition out of ny stte on ny symol, DFA is esier to simulte y progrm thn n NFA. How to Build Lexicl Anlyzer Step1 Convert the Grmmr into Trnsition Digrm. Step2 Convert the Regulr Expression into Nondeterministic Finite Stte Automt. Step Convert the NFA into DFA. Step4 Minimize Finite Stte Automt. Step5 Write n efficient progrm for the minimized finite stte utomt, clled (minimized finite stte utomt recognizer).