CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2] 1 What is Lexical Analysis? First step of a compiler. Reads/scans/identify the characters in the program and groups them into tokens Tokens are of the form <token-name, attribute-value> or <token-name> Lexeme: examples of tokens Example 1: a=b+c becomes <id,1> <=> <id,2> <+> <id,3> 5 lexemes (a, =, b, +, c) 5 tokens Tokens are stored in a symbol table 2 1
What is Lexical Analysis? Example 2: Position = initial + rate * 60 <id,1> <=> <id,2> <+> <id,3> <*> <60> White space deleted in lexical analysis. No white space here 3 How to identify tokens? By finding patterns, because tokens have many different patterns. Patterns: different forms of tokens. Examples of patterns: Pattern 1: keywords: if, else, system, out, in, Pattern 2: operators: +, -, %, *, >=, <>, ==,<=, Pattern 3: variables: a, xyz, a_b, p, q2, _e, _001abc, Pattern 4: numbers: 23, 3.45, -7, 0 Other patterns.. Example: if (speed > 130 ) system.out.println( Ticket 300 SAR ); This java code has many token patterns: if, speed, >, 130, system,., out, println, (,, Ticket 300 SAR,, ), ; 4 2
Regular Expressions How to identify token patterns? Answer: By regular expressions Regular expression [we already learned this in CS 301: Theory of computation]: is a regular expression denoting the empty set { }. Every symbol a is a regular expression denoting {a}. If r 1, r 2 are two regular expressions then: r 1 * denotes zero or more occurrences of r 1 r 1 + denotes one or more occurrences of r 1 r 1 r 2 denotes concatenation r 1 r 2 denotes either r 1 or r 2 Example: regular expression for integers Suppose that, in a programming language, integers are like these: 2, 999, -50, +34, -00023, +0, -0, +000 So, regular expression for them: integer (+ ) (0 1 2 9) + Example: regular expression for decimals Suppose that decimals are like these: 0.0, 003.922, +4.001, -44.000 So, regular expression for them: decimal integer.(0 1 2 9) + Regular Expressions Example: Regular expression for identifiers of C language Identifiers (also called variables) in C languages are like these (similar to Java): a, a_1, a2, _,, _p, y98, Masud_Hasan, Identifiers are used in statements like this: a = a + 1; _p = b*a_1; But -55, 23masud, 0, 2, are not identifiers Identifier must start with a letter or _, then a letter, _, or digit can repeat So, the regular expression for identifier is: id (letter _) (letter _ digit)* But, we also need to say what is letter and what is digit? Complete regular expression is: letter A B C Z a b c z digit 0 1 2 9 id (letter _) (letter _ digit)* Example: Regular expression for White Space ws (blank tab new line) + 6 3
Transition Diagrams Transition diagrams: An intermediate step after regular expression, pictorial, like a graph, easier way to understand patterns. Accepter (says yes if match). Example: Transition diagram for relational operators (relop) Relational operators (relop) in Pascal are: =, <, >, <>, <=, >= - Circle means state - Double circle means accepting state Equivalent regular expression: relop < > = < > < = >= 7 Finite Automata (learned in CS 301) Finite Automata: They are graphs, like transition diagrams But, they are recognizer, that means, they say yes or no. If string finished and final state, then YES. If string finished and not final state, or final state but string not finished, then NO. Two types: NFA and DFA are same. For any regular expression, there are equivalent NFA and DFA. 8 4
Example: NFA for (a b)*abb NFA Try some examples for this NFA: YES: aaaabb, ababababb, abb,. NO: a, bb, b, abab, abbbb,. 9 Example: NFA for aa* bb* NFA means transition without anything. You can add anywhere any number of times. Try some examples for this NFA: YES: b, a, aa, bb, aaa, bbb,.. For example, a = a, so YES. NO: ab, bba, ba, abab,. 10 5
DFA Example: NFA for (a b)*abb (from previous slide) Equivalent DFA for (a b)*abb NFA are easier. DFA are difficult. Try some examples for this DFA: YES: aaaabb, ababababb, abb,. NO: a, bb, b, abab,. 11 Write Program for Lexical Analyzer Lexical analyzer itself is a program. Actually, the whole compiler is a program. However, this program must be written in a language that already exists. For example, if you want to write a new programming language D now in year 2016, then its lexical analyzer (and also other parts of the compiler of D) must be written in a language that is available now, such as C, C++, Java, Python, etc. The algorithm for writing the lexical analyzer program is based on the NFA or DFA. The program sans the input program (D language program) and identifies the tokens and put them in symbol table. 12 6
Overall Picture of Lexical Analyzer of a Language like C, C++, Java, Identify all possible patterns in the language: letter, digit, number, keywords, arithmetic operators, logical operators, For each of them, write regular expression. For each pattern, construct NFA or DFA. Draw transition diagrams for convenience and as an intermediate step. Combine them in one diagram. Write program for Lexical Analyzer accordingly in an existing different language to identify the tokens. Take input program (C or C++ or Java ) as string. Run the program for Lexical Analyzer and identify the tokens and store them in symbol tables. 13 Example: Overall Picture of Lexical Analyzer of a Language - Suppose that, we want to write a new programming language D. It only has two keywords KSA and 123. We want to write its compiler in C. As a first step, we write only Lexical Analyzer. The two regular expressions are: KSA and 123. Combined regular expression is: KSA 123. NFA: Start Start K 1 S 2 A 3 Combine: Start K 1 S 2 A 3 14 7
Example: Overall Picture of Lexical Analyzer of a Language - Suppose that the input D program are stored in the string input. - Following is a sample C code for Lexical Analyzer of D if (input[0] == K ) if (input[1] == S ) if (input[2] == A ) symbol_table[0] = KSA ; else if (input[0] == 1 ) if (input[1] == 2 ) if (input[2] == 3 ) symbol_table[0] = 123 ; Example 1: Example 2: my_program_1.d KSA123123 Symbol Table 0 KSA 1 123 2 123 my_program_2.d KSS Symbol Table Empty 15 8