Compiler Design. Lexical Analysis

Compiler Design Lexical Analysis

What it Lexical Analysis It is the phase where the compiler reads the text from device source program lexical analyzer token get next token parser symbol table Reading has to be character by character, but buffering is helps

What it Lexical Analysis It detects valid tokens, which are equivalent to words in a text For example, from the following code segment, If xval <= yval then result := yval The lexical analyzer will detect the following unbreakable meaning ful components if, xval, <=, yval, then, result, :=, yval

What it Lexical Analysis It will check if these meaningful components, which are like words, are valid from the point of view of the given language. If a components is valid according to the language, then Lexical Analyser will determine its type. Or Token. In the example, tokens are identified as follows: if == keyword <= == logical operator xval == identifier := == assignment operator yval == identifier result == identifier

What it Lexical Analysis Question: What is a valid token? Answer: There will be set rules to define valid tokens. For example, rule an identifier is usually like starts with alphabet and followed by a combination of alphabet and digit or nothing So, xval is a valid identifier. But, if we write 9xval, it won t be a valid identifier. In fact, generally, it will be valid nothing. No keyword or operator or numeric value.

Valid Tokens Valid operators : fixed strings =, <, <=, >, >= Valid numeric: there is a pattern Starts with digit Followed by digit There may be a decimal Digits after decimal 12.34 12.34E10 12.34E-5 Then there may be an E (exponent) followed by signed or unsigned integer

File Reading is overhead Most expensive phase of Compiler because it reads the text from device extensive Input operation Though it processes the input character by character for matching set patterns Better to read a block (say 1024 bytes) at a time and place it in a buffer. Afterwards process from buffer. Why? Every read involves a system call, therefore context switch. It is more time saving to have one system call per 1024 characters than one per character.

File Reading is overhead So, the following happen in a loop for each character. 1. Reads a character from the disk (System Call) 2. Compares with a pattern (User mode) 3. changes state (User Mode) 4. Go to 1 So, we see that for every input character read there is a system call. It means context changes from user to system. After the read, the context is changed from system to user again.

What it Lexical Analysis Context Change being an overhead, for every input character we are incurring 2 such overheads. If we could read, say 1024 characters at once, through a single system call, it will save context switch time by 1024 times.

Lexical Analyzer tasks: Loop 1 to 2 until end of file 1. Read a character from disk THIS IS A SYSTEM CALL CPU changes context to OS, thus saving PCB of user process 2. Match pattern Here the CPU goes back to the user mode, changing context, saving PCB of OS and retrieving PCB of the user process So, if there is a file of 10,000 characters, context change will take place 20,000 times.

User Process OS Pattern match Pattern match Pattern match Context switch Context switch Context switch Context switch Context switch Context switch Read a character Read a character Read a character Instead, if we could 2048 context switches for 1024 characters!!!!

User Process OS Context switch Read 1024 Read 1024 characters characters Only 2 context switches instead of 2048 Pattern match Context switch

How it happens using a buffer Base n newval5=oldval*12 Forward

How it happens using a buffer e Base n newval=oldval*12 Forward

How it happens using a buffer ew Base n newval=oldval*12 Forward

How it happens using a buffer ewv Base n newval=oldval*12 Forward

How it happens using a buffer ewva Base n newval=oldval*12 Forward

How it happens using a buffer ewval Base n newval5=oldval*12 Forward

How it happens using a buffer ewval5 Base n newval5=oldval*12 Forward

How it happens using a buffer ewval5 Base n newval=oldval*12 = Forward

How it happens using a buffer ewval5 Base n newval=oldval*12 Forward = Retract; Return (Gettoken ) The string between BP and FP is the next token (if it has adhered to any rule)

How it happens using a buffer Base newval=oldval*12 Forward = Retract; Return (Gettoken ) BP is sprung forward to FP; Now the next token will be found.

Okay, input characters are read in blocks and put in a buffer / array in main memory. The characters are scanned one by one. What follows, is that when the last character in the buffer has been read and processed, the buffer needs to be reloaded Consider a scenario where the buffer ends before the variable/ identifier oldval is complete. Base newval = old Forward

Obviously, next block has to be read into the buffer As a result, the current buffer is overwritten. Base val * 12. Forward Therefore, previous content of the buffer is lost. BP doesn t point to the earlier content.

Way out : Two buffers, or split buffers to be reloaded alternatively Initially, both buffers are empty. Base Forward

Way out : Two buffers, or split buffers to be reloaded alternatively Read the first one. Base newval = old Forward

Way out : Two buffers, or split buffers to be reloaded alternatively Read the first one. Keep on scanning and processing till the FP reaches the last character. That means, the lexical analyzer is inside the last potential token. Base newval = old Forward

Way out : Two buffers, or split buffers to be reloaded alternatively Read the next block and reload the second buffer. Base newval = old val * 12. Forward

Way out : Two buffers, or split buffers to be reloaded alternatively Read the next block and reload the second buffer. Move the forward pointer by one character. It therefore enters the second buffer. Continue processing as if it is a single buffer. Base newval = old val * 12. Forward

Necessity of buffering (A) To read a block of characters at a time, thus reducing context switching time and I/O time as well. Otherwise, the process would be : Loop: I/O request to read next character /* System Call */ execute processing logic /* User mode */ go back to Loop. There would be 1024 system calls for reading 1024 characters. Also, separately requesting the disk controller to read each character. Using block read and buffering, the one system call and I/O request is issued per block of characters, for example, 1024 characters.

Necessity of buffering (B) Some times, the lexical analyzer might have to look ahead in order to identify a token. Example. There are two similar looking FORTRAN statements, along with their meaning. (i) DO 5 I = 1.25 /* DO5I is a variable. Set its value to 1.25 */ /* FORTRAN allows spaces in variable name */ (ii) DO 5 I = 1,25 /* Execute line#5 for I = 1 to 25 */ BP FP Look ahead The difference of the two is in the. and, between 1 and 25. Lexical analyzer will have to read forward (look ahead), halting the FP to detect the presence of a,. After that, it goes back to FP.

Limitation of Buffer Pair (1) If look ahead character is beyond the end of buffer Such as in PL/1 DECLARE (ARG1,ARG2,.ARGn) To determine if DECLARE is an array or key word, it has look ahead till the closing ) (2) With every character scan, lexical analyzer has to check if it is end of block. Since there are two buffers, it has to check two times, if it is end of buffer 1 or buffer 2. Algorithm and its alternative are furnished in the following two slides.

Algorithm With every character read, the program has to check end of buffer. FWD = End of 1 st Half? NO YES Reload 2 nd Half Reload 1st Half YES FWD = End of 2 nd Half? Move FWD to the beginning of the 1 st Half NO FWD := FWD + 1 Back to loop Check 1st buffer. If not end, then check 2nd buffer

Alternative (more efficient) FWD = $? NO Back to loop Use sentinel value for end of buffer. Put a $ at the end of each buffer half. Reload 1st Half Move FWD to the beginning of the 1 st Half $ $ YES It must be the end of 1 st half YES FWD = End of 2 nd Half? NO Reload 2nd Half Back to loop Happens once out of 1024 Though it looks like two checks, but the second check comes only when FWD = $ Which happens once for each buffer. If each buffer is of 1024 characters, then for 1023 times the second check does not happen

Specification of Tokens Regular expressions are used for recognizing patterns Let s understand how Regular Pattern is represented. First of all, a single character is Regular Language in itself. (1) Union Operation on Regular Languages Alphabet A is itself a Regular Language. Likewise, B and also C and so on. Therefore, {A,B,C,.Z} will be regular since it is the union of regular languages.

Specification of Tokens Say, L = {A, B, C,..Z,a,b,c,.z} /* all the letters */ and D = {0,1,2,3,4,5,6,7,8,9} /* all the digits*/ Then UNION: L U D = {A, B, C,..Z,a,b,c,.z, 0,1,2,3,4,5,6,7,8,9} is also a regular language, representing the set of all letters and digits. (2) Concatenation L. D consists of each element of L concatenated with each element of D, thus resulting in the set {A0, A1, A2.A9,B0,B1,B2 B9..z9} Which is also regular

Specification of Tokens (2) Exponent / Kleene Closure L is regular. So, the concatenation, L L, which can be written as (L ) 2 is also regular Similarly, (L ) 3, (L ) 4, (L ) 5, etc. are all regular. And then, ɛ is regular and (L ) 0 = ɛ Therefore, (L ) 0 + (L ) 1 + (L ) 2 + (L ) 3 + (L ) 4 +. Any number of elements Is regular. This is called Kleene Closure (L )*

Specification of Tokens (2) Transpose / Reverse Reverse of regular string is also regular. Set of reverses of all strings in a regular language is also regular So, {A0,A1,A2} being regular, {0A,1A,2A} is also regular.

Specification of Tokens Regular Expressions A regular language is represented by a regular expression. As mentioned before, a single character is a regular expression. Like regular languages, union, concatenation, Kleene closure and reversal of a regular expression result in regular expressions. So, if R1 and R2 are two regular expressions, then following are also: R1+R2 R1.R2 R1* alternatively, R1 R2 And any combination of the above three

Specification of Tokens Example: Identifiers in Pascal Regular Expressions Starts with an alphabet followed by any number of alphabets and digits in any order. Letter A B C. Z a b c z Digit 0 1 2 3 4 5 6 7 8 9 Id Letter. (Letter Digit ) * Note: (1) Letter and Digit have to be defined before Id. (2) Character class notation can also be used to declare Letter and Digit. Letter [A-Za-z] Digit [0-9]

Specification of Tokens Regular Expressions Example: Floating point numbers Rule: Digits, optionally separated by a., optionally followed by E and signed/unsigned digit Digit [0-9] Digits Digit.(Digit )* Optional_Fraction. Digits ɛ Optional_Exponent (E (+ - ɛ) digits) ɛ Floating _point_number Digits. Optional_Fraction. Optional_Exponent