Principles of Software Engineering and Operational Systems Languages and Compilers SDAGE: Level I 2012-13 4. Lexical Analysis (Scanning) Dr Valery Adzhiev vadzhiev@bournemouth.ac.uk Office: TA-121 For some images: Copyright 2009 Elsevier, Inc. All rights reserved
Contents Lexical Analysis and Scanner Functionality Tokens and Their Specifics Scanner Implementation: Ad-Hoc Direct-coded Pure DFA Table-Driven DFA Lex: Scanner Generator Check Your Understanding 2
Lexical / Syntax Analysis Together, Scanner and Parser are responsible for discovering the syntactic structure of the program. Scanner s principal job: to reduce the quantity and complexity of information that must be processed by Parser. Parser is in control of recognising of syntactic structure: Scanner is called by Parser when it needs next token Separation Lexical and Syntax Analysis allows for: Better efficiency at both phases Portability: parts of lexical analyzer may not be portable, but parser usually is portable. During reading input files lexical analyzer includes buffering of input, which is platform-dependent. Syntax analyzer always is platform independent 3
Scanner Functionality Main function - tokenising : Aggregates characters into substrings to form words (lexeme) Applies set of rules describing lexical structure (microsyntax) to determine if each word (lexeme) is valid (i.e. matches the pattern) If it is valid, Scanner assigns it a syntactic category thus recognising a token - the smallest meaningful language entity If not, then Lexical error. Saves tokens with source locations (file, line, column) to make it easier to generate error messages in subsequent phases Saves text of interesting tokens (identifiers, strings, numerical literals, ) Removes comments Deals (often) with Pragmas (i.e. significant comments) 4
Dealing with Special Tokens Handling keywords (reserved words): Treat them as exceptions to rule for identifiers: before returning id, scanner looks it up in a special hash table to make sure it s not keyword Need to peek ahead further than for one character Near universal rule: always try to recognise longest possible token from input which means you return only when next character can t be used to continue current token foobar not f or foob; 3.14159 is a real const 3.14159 and never 3,., and 14159 White space (blank,s, tabs, newlines) is generally ignored, except to extent that it separate tokens (then, foo bar is different from foobar) In some cases one may need to peek at more than one character: In Pascal: when you have a 3 and next is a dot.: do you proceed in hopes of getting 3.14 or do you stop (in fear of getting 3..5 (.. can be token!) In Fortran even messier case (e.g., with need to unread buffered characters): DO 5 I = 1, 25 or DO 5 I = 1.25 (NASA s Mariner 1!) 5
Token Attributes Attribute of token: additional information on the specific lexeme For simplicity, a token may have a single attribute which holds the required information for that token. For identifiers, this attribute a pointer to the symbol table, and the symbol table holds the actual attributes for that token. Attributes for some tokens: <id,attr>: attr is pointer to the symbol table <assign-op,_>: no attribute is needed (if there is only one type of assignment operator) <num,val>: val is the actual value of the number. 6
Pragmas Pragmas: constructs that provide directives or hints to compiler Pragmas do not change program semantics only compilation process ( significant comments ) Turn various run-time checks on and off Turn certain code improvement on and off Enable or disable performance profiling (stats, etc.) Scanner usually deals with pragmas in languages where they can appear anywhere in the source. Examples of pragmas as hints for compiler: Variable x is very heavily used Keep it in register! Subroutine F is pure function Its only effect is returning value Subroutine S is not (indirectly) recursive Its storage can be statically allocated 32 bits of precision (instead of 64) suffice for floating-point variable x. Keep it in register! Compiler may ignore these: In the interest of simplicity or In face of contradictory information 7
Calculator Language: Tokens := is used for assignment Tokens read and write are listed as exceptions to the rule for id: actually, they are treated as keywords Two styles of comments (as in C) are allowed (no nested comments of same type, but different can appear inside each other to allow commenting out ) 8
Scanner Implementation Ad Hoc approach (Hand-Coded) Production compilers often use ad-hoc scanners as they generally yield fastest, most compact code by doing lots of specialpurpose things Semi-mechanical Pure DFA (Direct- Coded) Table-driven DFA DFA-based implementation is preferable during development as they allow to build scanner in more structured way 9
Ad Hoc Scanner Simpler and more common cases checked first Read characters one at a time with look-ahead ( peek ) when needed Embed loops for comments and for long tokens When invoked again, scanner repeats from beginning, using next available characters including those peeked but not consumed Lexical errors?! 10
DFA-based Scanner Implementation Write language lexical specification and convert it into REs Convert REs into nondeterministic FA (NFA) Translate NFA into equivalent DFA Optimise (minimise) the DFA Implement the DFA either through Direct- Coded approach or using Table-Driven Scanning Typical Scanner Generator 11
Recognising Multiple Kinds of Tokens Scanner differ from just a formal DFA in that it identifies tokens in addition to recognising them I.e., indicate which one. In practice, this means it must separate final states for every kind of token To build scanner for language with n different kind of tokens: Begin from NFA: {M i, i=1,n} Create a new start state with ε transitions In contrast to normal alternation construction, do not create single final state keep existing ones, labeled by token for which it is final Then apply NFA-to-DFA as before. In DFA minimisation phase, instead of starting with two equivalence classes (final and non-final states), begin with n+1, including separate class for final states for each kind of token. 12
Scanner for Calculator: DFA FA starts in distinguished initial state When reaches one of designated set of final states, it recognises the token associated with the state Comments, when recognised, send the scanner back to its start state The longest possible token rule means: scanner returns to parser only when next character cannot be used to continue current token. 13
Scanner Code: Pure DFA This direct-coded hand-written approach embeds automation in control flow of program using nested case (switch) statements Outer case statement covers states of FA. Inner cases cover transitions out of each state Most of inner clauses set a new state Some return from scanner with current token (if current character should not be part of that token, it is pushed back onto input stream) Easier to write and to debug than ad hoc approach, if not quite as efficient. 14
Scanner Tables For Calculator 15
Scanner Tables and Driver Scanner Tables generated for calculator language: States are numbered as in DFA Calculator Graph with addition of states 17 and 18 to recognise white space and comments Three main tables: scan_tab: each entry specifies action: to move to a new state (and if so, which) Return a token Announce an error token_tab: indicates for each state whether we might be at end of token (and if so, which one) Separating this table from the main one allows for noticing when we pass a state that might ve been end of token, so we can back up if we hit error state. keyword_tab: contains read and write. Driver for a table-driven scanner (declarations) Scanner must return: The kind of token found Its character-string image (spelling) needed for semantic analysis and error messages 16
Driver for Table-Driven Scanner Driver Program (generic Skeleton Scanner ): Uses current state and input character to index into scan_table. Before returning: looks tokens up in keyword_tab Outer loop serves to filter out comments and white space (spaces, tabs, newlines) Lexical Errors: Next character of input may be neither acceptable continuation of current token nor start of another token Scanner must print message and perform some sort of recovery: Throw away current invalid token Skip forward until next proper character found Restart scanning algorithm Count on error-recovery mechanism of parser 17
Lex: Scanner Generator Lex: Linux tool for automatically generating a scanner given lex specification (.l file) Lex source is actually a table of REs and corresponding program fragments implementing DFA Lex inputs Lex file and generates a C program with function yylex() to be called by parser (usually yacc) There are free open-source analogs of lex: notably flex (being used with Bison parser) A lot of tutorials available online 18
Lex Specification Lex File Structure: Definition section: defines macros and imports header files written in C. It is also possible to write any C code here, which will be copied verbatim into the generated source file. Rules section: associates RE patterns with C statements. When the lexer sees text in the input matching a given pattern, it will execute the associated C code. C code section: defines macros and imports header files written in C. It is also possible to write any C code here, which will be copied verbatim into the generated source file. REs in Lex: http://dinosaur.compilertools.net/lex/index.html 19
Lex: Example Example: Generate scanner that recognizes strings of numbers (integers) in the input, and simply prints them out. If this input is given to flex, it will be converted into lex.yy.c. This is compiled into an executable which matches and outputs strings of integers. Given input: the program will print: http://en.wikipedia.org/wiki/lex_(software) 20
Exercises Build ad-hoc scanner (e.g., in C) for the calculator language As output, have it print a list, in order, of the input tokens. For simplicity, feel free to simply halt in the event of lexical error. Try Lex or Flex tools for the calculator language. Compare your program in C and the generated scanner in C. 21
Check Your Understanding List the tasks performed by a typical scanner What are pragmas? Explain the reasons behind the longest possible token rule. Why must scanner save the text of tokens? Why must it sometimes peek at upcoming characters? Explain the main approaches to scanner implementation. What are the advantages of automatically generated scanner in comparison to a handwritten one? Why do many commercial compilers use a handwritten scanner anyway? Describe the process of building the scanner using Lex tool. 22