Typical tradeoffs in compiler design are: speed of compilation size of the generated code speed of the generated code Speed of Execution Foundations

Size: px
Start display at page:

Download "Typical tradeoffs in compiler design are: speed of compilation size of the generated code speed of the generated code Speed of Execution Foundations"

Transcription

1 Compilers overview There are many aspects to be considered in the study of compilers. Usually the study encompasses more than just the strict definition of a compiler. In General a compiler is: A program that translates source code into object code. A compiler "takes as input the specification for an executable program and produces as output the specification for another, equivalent executable program." A compiler is usually a fairly large program that is composed of several components. It must be planned, designed, implemented, tested, and delivered according to the best software engineering practices. Typical tradeoffs in compiler design are: speed of compilation size of the generated code speed of the generated code Speed of Execution Foundations Automata theory plays a crucial role in lexical analysis. All scanner generators are based upon the results of automata theory. Properties of language, such as context-free languages are exploited in parsing theory. Design Compilers must be designed. They require explicit interfaces between the various components. Since we will be using Java, we should expect to use abstraction, encapsulation, and polymorphism along with appropriate design patterns. Computer architecture A compiler that produces some form of executable code must take into account the underlying machine architecture, even if that architecture is a virtual machine like the Java Virtual Machine (JVM). Compilers are expected to generate efficient code. Code generating patterns used for one type of architecture (e.g., RISC) might be counterproductive and inappropriate for another type of architecture (e.g. CISC). Compiler vs. Interpreter An interpreter translates some form of source code into a target representation that it can immediately execute and evaluate. The structure of the interpreter is similar to that of a compiler, but the amount of time it takes to produce the executable representation will vary as will the amount of optimization. The following diagram shows one representation of the differences. Compiler characteristics: spends a lot of time analyzing and processing the program the resulting executable is some form of machine- specific binary code the computer hardware interprets (executes) the resulting code program execution is fast

2 Interpreter characteristics: Relatively little time is spent analyzing and processing the program the resulting code is some sort of intermediate code the resulting code is interpreted by another program program execution is relatively slow Structure of a compiler The general structure of a compiler is shown below You start with one representation of the program (using a very broad interpretation of what we mean by "program"). That representation, the source, is analyzed for structural correctness by the lexical analyzer (scanner) and the parser. The parser cooperates with the semantic analyzer to ensure that the program is not only structurally correct, but meaningful in terms of the source language's semantics. These actions are typically referred to as the compiler's front end. One of the challenges of compiler design is the communication mechanisms for each interface between the different parts of a compiler. The following diagram is a more detailed look at the structure of a typical compiler. Along with communication issues, you need to decide whether each module processes a complete compilation unit before passing it onto the next module, or if they all work in a cooperative manner on small parts of the input at any given time. This is called the compiler's bandwidth.

3 Front End Structure A typical structure for the front end of a compiler consists of three parts: The lexical analyzer, or scanner, that converts the source code into a stream of tokens. Each token represents the occurrence of a single lexical element, called a lexeme, in the source program. The parser that analyzes the token stream to ensure that it is a instance of a string that is in the source language. The context, or semantic, analyzer that ensures that the input is meaningful. For example, it ensures that all variables that are referenced have been declared (if that is a requirement of the language). Middle Structure Most modern compilers have an optimization phase that sits between the front end and back end of the compiler. Usually, the optimizer consists of several cooperating modules that take, as input, some intermediate form of the program under compilation and produces a transformation of it that is (hopefully) optimized. The optimizations phases of most compilers today are composed of several modules because it is much easier to design and maintain the optimizer than if it were implemented as a single, monolithic, module that tried to do all optimizations. Back End Structure The back end of the compiler is responsible for emitting the final (executable) version of the source program. Typical parts of the back end are responsible for: instruction selection register allocation memory management instruction scheduling Common Infrastructure A compiler will have certain modules that are used to collaborate with modules in the front, middle, and back of the compiler. These typically consist of: symbol tables grammars trees graphs other data structures

4 Lexical Analysis 1. Overview Scanning is also called lexical analysis. This is because the scanner analyzes the input stream, consisting of lexical elements, called lexemes. The term lexeme comes from the field of linguistics, where it means the smallest unit of a language which has meaning. Scanning is, perhaps, the part of compilation that is most based upon formal mathematics and foundations of computer science. 1.1 First program transformation The scanner converts the source program's stream of lexemes into an equivalent representation of tokens. Tokens represent the lexeme and the context of the lexeme. For example, a typical token contains: The type of lexeme (e.g., IDENTIFIER) The lexeme's image ("a Variable") The lexeme's position in the input stream The next part of the compiler, the parser, works on a stream of tokens. The main task of lexical Analyzer is to read a stream of characters as an input and produce a sequence of tokens such as names, keywords, punctuation marks etc.. for syntax analyzer. It discards the white spaces and comments between the tokens and also keep track of line numbers. Some General classified Tasks performed by Lexical Analyzer are Tokens, Patterns, Lexemes Specification of Tokens o Regular Expressions o Notational Shorthand Finite Automata o Nondeterministic Finite Automata (NFA). o Deterministic Finite Automata (DFA). o Conversion of an NFA into a DFA. o From a Regular Expression to an NFA. Terminologies 1. Token-: A lexical token is a sequence of characters that can be treated as a unit in the grammar of the programming languages. Example of tokens: Type token (id, num, real,... ) Punctuation tokens (IF, void, return,... ) Alphabetic tokens (keywords) Example of non-tokens: Comments, preprocessor directive, macros, blanks, tabs, newline, Patterns There is a set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token. Regular expressions are an important notation for specifying patterns. For example, the pattern for the Pascal identifier token, id, is: id letter (letter digit)*. 3. Lexeme A lexeme is a sequence of characters in the source program that is matched by the pattern for a token.

5 For example, the pattern for the RELOP token contains six lexemes ( =, < >, <, < =, >, >=) so the lexical analyzer should return a RELOP token to parser whenever it sees any one of the six. 1.2 Clean up the input Scanners normalize the input stream. There is often unwanted text, such as comments, white space, and other lexemes that have nothing to contribute to the program translation. The scanner removes it, or puts it in some special form, so that the parser and does not have to attend to it. 1.3 Lexeme interpretation One of the interesting decisions a compiler writer must make is what type of analysis and synthesis operations should go in which compiler component. For example, consider the following part of a program's source: How many lexemes (and corresponding tokens) are there? Various answers, all correct, are 1, 2, and 4. Let's look a little further into this. If your scanner takes into account the structure of real and signed numbers, there is a single token that would represent a real number with a value of Perhaps your scanner separates the sign from the number. Then you have two tokens. The first represents the minus sign. The second represents the positive real number Maybe your scanner takes even smaller bites of the input. Then, you would have the four lexemes: The last approach makes for a simpler scanner. It does, however, put more of a burden upon the parser. You need to decide which is more appropriate. My preference is to use the last approach. This has a couple of benefits: The scanner is not responsible for determining the appropriate format for numbers. The scanner does not have to make a determination as to whether the text: is an arithmetic expression or two numbers. 2. Theoretical foundations There are two theoretical building blocks that have helped make scanner construction efficient and automatic: Finite automata and Regular expressions. 2.1 Regular expressions Regular expressions give us a way to formally express the lexical structure of a language If a and b are regular expressions, then: (a) is a regular expression. a b is a regular expression. ab is a regular expression. a * is a regular expression. 2.2 Finite automata Finite automata are a way of representing the structure of regular languages, and hence, regular expressions. There are two type of finite automata: deterministic (DFA) and non- deterministic (NFA). Both of these are used in the construction of many types of scanners. Even when scanners are written by hand, the state-based approach and results from finite automata are used. 3. Three important results When building a scanner, you will deal with tables that represent states of finite automata, and operations of combining regular expressions Kleene's construction that takes a DFA to regular expressions. This is important for ensuring that the DFA that you use actually represents the language you think it does. Thompson's construction that creates an NFA from a regular expression. This is the first step to converting the language of the regular expression grammar to an executable scanner. Subset construction that transforms an NFA to a corresponding DFA. The DFA is the final form we want to have for implementing the scanner since our programming languages are (usually) deterministic. There is a fourth result that you use to minimalize the DFA that results from the application of subset construction. The result is Hopcroft's algorithm.

6 LEX LEX is a program that generates lexical analyzers ("scanners" or "lexers"). Lex is commonly used with the yacc parser generator. Lex, originally written by Mike Lesk and Eric Schmidt, is the standard lexical analyzer generator on many Unix systems, and a tool exhibiting its behavior is specified as part of the POSIX standard. Lex reads an input stream specifying the lexical analyzer and outputs source code implementing the lexer in the C programming language. Though traditionally proprietary software, versions of Lex based on the original AT&T code are available as open source, as part of systems such as Open Solaris and Plan 9 from Bell Labs. Another popular open source version of Lex is Flex, the "fast lexical analyzer". During the first phase the compiler reads the input and converts strings in the source to tokens. With regular expressions we can specify patterns to lex so it can generate code that will allow it to scan and match strings in the input. Each pattern specified in the input to lex has an associated action. Typically an action returns a token that represents the matched string for subsequent use by the parser. Initially we will simply print the matched string rather than return a token value. The following represents a simple pattern, composed of a regular expression that scans for identifiers. Lex will read this pattern and produce C code for a lexical analyzer that scans for identifiers. letter (letter digit)* This pattern matches a string of characters that begins with a single letter followed by zero or more letters or digits. This example nicely illustrates operations allowed in regular expressions: Repetition, expressed by the * operator Alternation, expressed by the operator Concatenation, expressed by. operator Any regular expression expressions may be expressed as a finite state automaton (FSA). We can represent an FSA using states, and transitions between states. There is one start state and one or more final or accepting states. Token Recognition By LEX Recognition of tokens with Lex Having described a way to characterize the patterns associated with tokens, we begin to consider how to recognize tokens i.e. recognize instances of patterns i.e. recognize the strings of a regular language. We ll use Lex: it generates an efficient scanner automatically, based on regular expressions. We need to specify patterns for the tokens: if, then, else, relop, id, num. We can use the regular definitions if if then then else else relop < <= = <> > >= digit [0 9] letter [a z A Z] id letter ( letter digit )* num digit+ (. digit+ )? ( E ( + - )? digit+ )? We ll assume in addition that keywords are reserved. So although the string if, for instance, belongs to the language denoted by id as well as the language denoted by if, our lexical analyzer should return token if given lexeme if. We will also assume that lexemes may be separated by white space a nonempty string of blanks, tabs and new lines.

7 Our scanner will strip out white space, using the regular definition below: delim blank tab newline ws delim+ If a match for ws is found, no token will be returned; instead we return the token after ws. We can do this according to the following table: regular expression token attribute value Regular Expression Action Attribute Value ws none none if if None else else None Then then none id id Lexme num num Lexme < relop LT <= relop LE = relop EQ <> relop NE > relop GT >= relop GE As before, in practice we will return the token and place the attribute value in a global variable. We ll build our Lex scanner in accordance with this table. (For instance, we won t directly define a pattern for relop.) Lex specifications A Lex program consists of three (four) parts: %{ C declarations %} regular definitions %% translation rules %% C functions, incl. yywrap() Anything included between the funny braces %{ and %} is copied verbatim from the lex file to lex.yy.c. The Lex regular definitions are similar to the regular definitions we have studied already. (We ll look more closely at the syntax of these in a moment.) The translation rules are statements of the form p1 action1 p2 action2... pn actionn where each pi is a regular expression and each action i is a C program fragment. When yylex() is called, it finds the longest prefix of the input that matches one of the regular expressions pi, places the lexeme in yytext, and executes the corresponding action (If two expressions match longest lexeme, prefer the first!) Typically, the action ends by returning the appropriate token. But if the action does not end with a return of control, then the parser proceeds to find the next lexeme and execute the corresponding action. Lex allows regular definitions, in addition to regular expressions. Here is a fragment of our first Lex example, starting with the regular definitions section:

8 ~Simple LEX Program % { % } delim [ \t\n] ws {delim}+ letter [A-Za-z] digit [0-9] id {letter}({letter} {digit})* number {digit}+(\.{digit}+)?(e[+\-]?{digit}+)? %% {ws} { /* no action and no return */ } if {return(if);} then {return(then);} else {return(else);} {id} {yylval = install_id(); return(id);} {number} {yylval = install_num();return(number);} %% % is used to separate different steps ~LEXER Implementation Options Hand written lexer Implement a finite state automaton start in some initial state look at each input character in sequence, update lexer state accordingly if state at end of input is an accepting state, the input string matches the RE Lexer generator generates tokenizer automatically (e.g., flex, jlex) Uses RE to NFA to DFA algorithm Generates a table-driven lexer (also an FSA) ~LEXER GENERATION STEPS Input: list of regular expressions describing tokens in language, in priority order associated action for each RE (generates appropriate kind of token, other bookkeeping) Process: Reads patterns Builds finite automaton to accept valid tokens Output: Implementation of FA that reads an input stream and breaks it up into tokens according to the REs. (or reports lexical error -- Unexpected character ) Lex (flex) : generates C code JLex : generates Java Compile and link C or Java code, you've got a scanner ~ Comparison of Both Methods Hand-coded scanner Programmer creates types, defines data & procedures, designs flow of control, implements in source language. Lex-generated scanner: Programmer writes patterns Declarative, not procedural Lex/flex implements flow of control Must less hand-coding, but code looks pretty alien, tricky to debug

9 Scanner in LEX (LEX Program which recognize valid Operators of C) %{ /* C declarations */ #define LT #define LE #define EQ #define NE #define GT #define GE #define RELOP #define ID #define NUM #define IF #define THEN #define ELSE int attribute; %} % delim [ \t\n] ws {delim}+ letter [A-Z a-z] digit [0-9] id {letter}({letter} {digit})* num {digit}+(\.{digit}+)?(e[+\-]?{digit}+)? %% ws { } if { return(if); } then { return(then); } else { return(else); } {id} { return(id); } {num} { return(num); } "<" { attribute = LT; return(relop); } "<=" { attribute = LE; return(relop); } "<>" { attribute = NE; return(relop); } "=" { attribute = EQ; return(relop); } ">" { attribute = GT; return(relop); } ">=" { attribute = GE; return(relop); } %% */ Auxilary Procedures/* Install id() Install num() % Some Lexical Errors Illegal chars Non terminated comments Ill-formed constants Error Handling in Lexical Analysis-:To handle Errors we simply discard the statement but if error occurs in the middle of lexme then we can not discard the entire program instead of it we try to correct it by appropriate syntax.

10 Input Buffering: Some efficiency issues concerned with the buffering of input. A two-buffer input scheme that is useful when lookahead on the input is necessary to identify tokens. Techniques for speeding up the lexical analyser, such as the use of sentinels to mark the buffer end. There are three general approaches to the implementation of a lexical analyser: 1. Use a lexical-analyser generator, such as Lex compiler to produce the lexical analyser from a regular expression based specification. In this, the generator provides routines for reading and buffering the input. 2. Write the lexical analyser in a conventional systems-programming language, using I/O facilities of that language to read the input. 3. Write the lexical analyser in assembly language and explicitly manage the reading of input. uffer pairs: Because of a large amount of time can be consumed moving characters, specialized buffering techniques have been developed to reduce the amount of overhead required to process an input character. The scheme to be discussed: Consists a buffer divided into two N-character halves. N Number of characters on one disk block, e.g., 1024 or Read N characters into each half of the buffer with one system read command. If fewer than N characters remain in the input, then eof is read into the buffer after the input characters. Two pointers to the input buffer are maintained. The string of characters between two pointers is the current lexeme. Initially both pointers point to the first character of the next lexeme to be found. Forward pointer, scans ahead until a match for a pattern is found. Once the next lexeme is determined, the forward pointer is set to the character at its right end. If the forward pointer is about to move past the halfway mark, the right half is filled with N new input characters. If the forward pointer is about to move past the right end of the buffer, the left half is filled with N new characters and the forward pointer wraps around to the beginning of the buffer. Disadvantage of this scheme: This scheme works well most of the time, but the amount of lookahead is limited. This limited lookahead may make it impossible to recognize tokens in situations where the distance that the forward pointer must travel is more than the length of the buffer. For example: DECLARE ( ARG1, ARG2,, ARGn ) in PL/1 program; Cannot determine whether the DECLARE is a keyword or an array name until the character that follows the right parenthesis. entinels: In the previous scheme, must check each time the move forward pointer that have not moved off one half of the buffer. If it is done, then must reload the other half. Therefore the ends of the buffer halves require two tests for each advance of the forward pointer. This can reduce the two tests to one if it is extend each buffer half to hold a sentinel character at the end. The sentinel is a special character that cannot be part of the source program. (eof character is used as sentinel).

11 In this, most of the time it performs only one test to see whether forward points to an eof. Only when it reach the end of the buffer half or eof, it performs more tests. Since N input characters are encountered between eof s, the average number of tests per input character is very close to 1. Brute-Force Approach A top-down parse moves from the goal symbol to a string of terminal symbols. In the terminology of trees, this is moving from the root of the tree to a set of the leaves in the syntax tree for a program. In using full backup we are willing to attempt to create a syntax tree by following branches until the correct set of terminals is reached. In the worst possible case, that of trying to parse a string which is not in the language, all possible combinations are attempted before the failure to parse is recognized. Top-down parsing with full backup is a " brute-force" method of parsing. In general terms, this method operates as follows: 1. Given a particular nonterminal that is to be expanded, the first production for this nonterminal is applied. 2. Then, within this newly expanded string, the next (leftmost) nonterminal is selected for expansion and its first production is applied. 3. This process (step 2) of applying productions is repeated for all subsequent nonterminals that are selected until such time as the process cannot or should not be continued. This termination (if it ever occurs) may be due to two causes. First, no more nonterminals may be present, in which case the string has been successfully parsed. Second, it may result from an incorrect expansion which would be indicated by the production of a substring of terminals which does not match the appropriate segment of the source string. In the case of such an incorrect expansion, the process is "backed up" by undoing the most recently applied production. Instead of using the particular expansion that caused the error, the next production of this nonterminal is used as the next expansion, and then the process of production application continues as before. If, on the other hand, no further productions are available to replace the production that caused the error, this error-causing expansion is replaced by the nonterminal itself, and the process is backed up again to undo the next most recently applied production. This backing up continues either until we are able to resume normal application of productions to selected nonterminals or until we have backed up to the goal symbol and there are no further productions to be tried. In the latter case, the given string must be unparsable because it is not part of the language determined by this particular grammar. As an example of this brute-force parsing technique, let us consider the simple grammar

12 where S is the goal or start symbol. Figure 6-1 illustrates the working of this brute-force parsing technique by showing the sequence of syntax trees generated during the parse of the string accd. Initially we start with the tree of Fig. 6-1 a, which merely contains the goal symbol. We next select the first production for S, thus yielding Fig. 6-1 b. At this point we have matched the symbol a in the string to be parsed. We now choose the first production for A and obtain Fig. 6-lc. Note, however, that we have a mismatch between the second symbol c of the input string and the second symbol b in the sentential form abd. At this point in the parse, we must back up. The previous production application for A must be deleted and replaced with its next choice. The result of performing this operation is to transform Fig. 6-lc into Fig. 6-1 d with the leftmost two characters of the given input string being matched. When the third symbol c of the input string is compared with the last symbol d of the current sentential form, however, a mismatch again occurs. The previously chosen production for A must be deleted. Since there are no more rules for A which can be selected, we must also delete the production for 5 and select the next production for S. This sequence of operations yields Fig. 6-le. The final step involves applying the first rule for B to Fig. 6-le. This application yields Fig. 6-1/. Figure 6-1 Trace of a brute-force top-down parse for string accd1. (Note: The symbol denotes the extent of the scanning process from left to right in the input string.) The remaining input symbols are then matched with the remaining symbols in the sentential form of Fig. 6-1 /, thereby resulting in a successful parse.

13 Peephole Optimization-: If the code is generated statement wise then it consist several redundancies and suboptimal construct. In order to remove such redundancies in a code for increasing efficiency we need optimisation over a code. Peephole Optimisation is a technique by which we can locally optimise a code. In peephole optimisation we replace shorter and slower sequences with faster sequences according to the available possibility Steps for Peephole optimisation-: Elimination of redundant load and store Elimination of multiple jumps Elimination of unreachable code Algebraic simplifications Reduction for strength improvement Use of machine idioms 1-: Eliminating redundant load and store-: Mov R a Mov a R In the above written two statements we can eliminate the first because there is no guarantee of execution of it. 2-: Eliminating multiple jumps -: Multiple unnecessary jumps are avoided to increase efficiency and speed of execution. E.g. If a<b then goto L1 If a<b then goto L L1: Goto L2 L2: a=a+3; L2: a=a+3; 3-: Eliminating Unreachable Code-: unreachability means the code which does not lies in the instruction boundaries. Any unlabeled instruction which immediately appears after the unconditional jumps can be removed. For debugging purpose unreachability always avoided 4-: Algebraic Simplifications-: useless algebraic calculations always waste memory in the program. Such instructions always be simplified to increase execution speed in a code. A=A+0 A=A*1. 5-: Reduction for strength-: To increase efficiency of the code we can replace costlier instructions with the cheaper instructions. X=Y**2 can be replaced with X=Y*Y. 6-: Using Machine Idioms-: Some machines having specific hardware instructions to implement certain operations efficiently. Some machines contain auto increment and decrements modes with the help of counters in addressing modes. These machine idioms are very helpful to save the memory. a=a+1 and a=a-1. LOOP OPTIMIZATION It is a machine independent optimization. In any program there always exist possibilities for improvement in inner loops. Loop optimization is helpful to improve such possibilities. There are two basics constraints for loop Optimization. 1-: Eliminating Induction variables 2-: Eliminating Loop invariant computations

14 Induction variables-: These are the variables which are used in loop. Their values are always found in lockstep s so their exist a possibility to optimise the code. J=J-1 t 4 =4*J the two statements are in lockstep because the value of j is decrementing with 1 and then t 4 is decremented by 4. Such identifiers are induction variables. Removing first instruction we can directly write t 4 = t 4-4. Loop Invariant Computations-: It s the construct where a same computation is performed, every time when loop is executed. Elimination Process-: We first identify the invariant computations then move them outside the loop without changing the meaning of actual program. We need following checks to optimize Loops. A-: Detection of loops-: A loop is a cycle in a flow graph which satisfies the two properties-: 1-: it must have a single entry node or header so that it is possible to move all loop invariant computations at a single unique place known as pre header. 2-: I t must be strongly connected, i.e it is possible to go from any node to any other node within the loop. At least one loop executes repeatedly. B-: If loop exist and it is not dependent on control flow of program then CONTROL FLOW ANALYSIS is required. C-: For Loop detection we use a graphical representation known as Program flow graph is used. D-: To obtain such graph we must create a partition on intermediate code into basic blocks. E-: For Basic Blocks we require leader statements F-: For getting leader statements we check following-: 1-: First statement is a leader statement 2-: The target of conditional and unconditional is leader 3-: a statement which immediately follows the conditional goto is leader statement G-: Basic block transformations are used to optimize the code construct YACC-Yet another compiler Compiler Computer program input generally has some structure; in fact, every computer program that does input can be thought of as defining an ``input language'' which it accepts. An input language may be as complex as a programming language, or as simple as a sequence of numbers. Yacc provides a general tool for describing the input to a computer program. The Yacc user specifies the structures of his input, together with code to be invoked as each such structure is recognized. Yacc turns such a specification into a subroutine that handles the input process; frequently, it is convenient and appropriate to have most of the flow of control in the user's application handled by this subroutine. The input subroutine produced by Yacc calls a user-supplied routine to return the next basic input item. Thus, the user can specify his input in terms of individual input characters, or in terms of higher level constructs such as names and numbers. The user-supplied routine may also handle idiomatic features such as comment and continuation conventions, which typically defy easy grammatical specification. Yacc is written in portable C.

15 Introduction Yacc provides a general tool for imposing structure on the input to a computer program. The Yacc user prepares a specification of the input process; this includes rules describing the input structure, code to be invoked when these rules are recognized, and a low-level routine to do the basic input. Yacc then generates a function to control the input process. This function, called a parser, calls the user-supplied low-level input routine (the lexical analyzer) to pick up the basic items (called tokens) from the input stream. These tokens are organized according to the input structure rules, called grammar rules; when one of these rules has been recognized, then user code supplied for this rule, an action, is invoked; actions have the ability to return values and make use of the values of other actions. 1-Yacc specification; 2-Grammar rules, 3-Preparation of lexical analyzers. 4-Working of parsers 5- Ambiguity and Conflicts 6-A simple mechanism for handling operator precedence s in arithmetic expressions. 7- Error detection and recovery. 8- Enviroment 1: Basic Specifications Full specification file looks like declarations %% rules %% programs The declaration section may be empty. Moreover, if the programs section is omitted, the second %% mark may be omitted also; thus, the smallest legal Yacc specification is %% rules Blanks, tabs, and newlines are ignored except that they may not appear in names or multi-character reserved symbols. Comments may appear wherever a name is legal; they are enclosed in /*... */, as in C and PL/I. The rules section is made up of one or more grammar rules. A grammar rule has the form: A : BODY ; A represents a nonterminal name, and BODY represents a sequence of zero or more names and literals. The colon and the semicolon are Yacc punctuation. Names may be of arbitrary length, and may be made up of letters, dot ``.'', underscore ``_'', and non-initial digits. Upper and lower case letters are distinct. The names used in the body of a grammar rule may represent tokens or nonterminal symbols. The end of the input to the parser is signaled by a special token, called the endmarker. If the tokens up to, but not including, the endmarker form a structure which matches the start symbol, the parser function returns to its caller after the endmarker is seen; it accepts the input. If the endmarker is seen in any other context, it is an error. It is the job of the user-supplied lexical analyzer to return the endmarker when appropriate; see section 3, below. Usually the endmarker represents some reasonably obvious I/O status, such as ``end-of-file'' or ``end-of-record''. 2: Grammar rules and Actions With each grammar rule, the user may associate actions to be performed each time the rule is recognized in the input process. These actions may return values, and may obtain the values returned by previous actions. Moreover, the lexical analyzer can return values for tokens, if desired. To facilitate easy communication between the actions and the parser, the action statements are altered slightly. The symbol ``dollar sign'' ``$'' is used as a signal to Yacc in this context.

16 Actions that do not terminate a rule are actually handled by Yacc by manufacturing a new nonterminal symbol name, and a new rule matching this name to the empty string In many applications, output is not done directly by the actions; rather, a data structure, such as a parse tree, is constructed in memory, and transformations are applied to it before output is generated. Parse trees are particularly easy to construct, given routines to build and maintain the tree structure desired. The Yacc parser uses only names beginning in ``yy''; the user should avoid confusing names. 3: Lexical Analysis The user must supply a lexical analyzer to read the input stream and communicate tokens (with values, if desired) to the parser. The lexical analyzer is an integer-valued function called yylex. The function returns an integer, the token number, representing the kind of token read. If there is a value associated with that token, it should be assigned to the external variable yylval. The parser and the lexical analyzer must agree on these token numbers in order for communication between them to take place. The numbers may be chosen by Yacc, or chosen by the user. In either case, the ``# define'' mechanism of C is used to allow the lexical analyzer to return these numbers symbolically. The specifications for these lexical analyzers use regular expressions instead of grammar rules. Lex can be easily used to produce quite complicated lexical analyzers, but there remain some languages (such as FORTRAN) which do not fit any theoretical framework, and whose lexical analyzers must be crafted by hand. 4-How the Parser Works Yacc turns the specification file into a C program, which parses the input according to the specification given. The parser produced by Yacc consists of a finite state machine with a stack. The parser is also capable of reading and remembering the next input token (called the lookahead token). The current state is always the one on the top of the stack. The states of the finite state machine are given small integer labels; initially, the machine is in state 0, the stack contains only state 0, and no lookahead token has been read. The machine has only four actions available to it, called shift, reduce, accept, and error. 5: Ambiguity and Conflicts A set of grammar rules is ambiguous if there is some input string that can be structured in two or more different ways. The parser can do two legal things, a shift or a reduction, and has no way of deciding between them. This is called a shift / reduce conflict. It may also happen that the parser has a choice of two legal reductions; this is called a reduce / reduce conflict. Note that there are never any ``Shift/shift'' conflicts. When there are shift/reduce or reduce/reduce conflicts, Yacc still produces a parser. It does this by selecting one of the valid steps wherever it has a choice. A rule describing which choice to make in a given situation is called a disambiguating rule. Yacc invokes two disambiguating rules by default: 1. In a shift/reduce conflict, the default is to do the shift. 2. In a reduce/reduce conflict, the default is to reduce by the earlier grammar rule (in the input sequence). Rule 1 implies that reductions are deferred whenever there is a choice, in favor of shifts. Rule 2 gives the user rather crude control over the behavior of the parser in this situation, but reduce/reduce conflicts should be avoided whenever possible. Conflicts may arise because of mistakes in input or logic, or because the grammar rules, while consistent, require a more complex parser than Yacc can construct. The use of actions within rules can also cause conflicts, if the action must be done before the parser can be sure which rule is being recognized. In these cases, the application of disambiguating rules is inappropriate, and leads to an incorrect parser. For this reason, Yacc always reports the number of shift/reduce and reduce/reduce conflicts resolved by Rule 1 and Rule 2. 6: Precedence There is one common situation where the rules given above for resolving conflicts are not sufficient; this is in the parsing of arithmetic expressions. Most of the commonly used constructions for arithmetic expressions can be naturally described by the notion of precedence levels for operators, together with information about left or right

17 associativity. It turns out that ambiguous grammars with appropriate disambiguating rules can be used to create parsers that are faster and easier to write than parsers constructed from unambiguous grammars. For all binary and unary operators desired. This creates a very ambiguous grammar, with many parsing conflicts. As disambiguating rules, the user specifies the precedence, or binding strength, of all the operators, and the associativity of the binary operators. This information is sufficient to allow Yacc to resolve the parsing conflicts in accordance with these rules, and construct a parser that realizes the desired precedences and associativities. 7: Error Handling Error handling is an extremely difficult area, and many of the problems are semantic ones. When an error is found, for example, it may be necessary to reclaim parse tree storage, delete or alter symbol table entries, and, typically, set switches to avoid generating any further output. It is seldom acceptable to stop all processing when an error is found; it is more useful to continue scanning the input to find further syntax errors. This leads to the problem of getting the parser ``restarted'' after an error. A general class of algorithms to do this involves discarding a number of tokens from the input string, and attempting to adjust the parser so that input can continue. To allow the user some control over this process, Yacc provides a simple, but reasonably general, feature. The token name ``error'' is reserved for error handling. This name can be used in grammar rules; in effect, it suggests places where errors are expected, and recovery might take place. The parser pops its stack until it enters a state where the token ``error'' is legal. It then behaves as if the token ``error'' were the current lookahead token, and performs the action encountered. The lookahead token is then reset to the token that caused the error. If no special error rules have been specified, the processing halts when an error is detected. These mechanisms are admittedly crude, but do allow for a simple, fairly effective recovery of the parser from many errors; moreover, the user can get control to deal with the error actions required by other portions of the program. 8: The Yacc Environment When the user inputs a specification to Yacc, the output is a file of C programs, called y.tab.c on most systems (due to local file system conventions, the names may differ from installation to installation). The function produced by Yacc is called yyparse; it is an integer valued function. When it is called, it in turn repeatedly calls yylex, the lexical analyzer supplied by the user to obtain input tokens. Eventually, either an error is detected, in which case (if no error recovery is possible) yyparse returns the value 1, or the lexical analyzer returns the endmarker token and the parser accepts. In this case, yyparse returns the value 0. Symbol Table Management A symbol table is a table that binds names to objects. We need a number of operations on symbol tables to accomplish this: We need an empty symbol table, in which no name is defined. We need to bind a name as an object. In case the name is already defined in the symbol table, the new binding takes precedence over the old. We need to look up a name in a symbol table to find the object the name is bound to. If the name is not defined in the symbol table, we need to be told that. We need to enter a new scope. We need to exit a scope, reestablishing the symbol table to what it was before the scope was entered. Implementation of symbol tables There are many ways to implement symbol tables, but the most important distinction between these is how scopes are handled. This may be done using two types 1- persistent (or functional) data structure, 2- Imperative (or destructively-updated) data structure persistent (or functional) data structure-: In functional languages like SML, Scheme or Haskell, persistent data structures are the norm rather than the exception (which is why persistent data structures are sometimes called functional).

18 For example, when a new element is added to a list or an element is taken off the head of the list, the old list still exists and can be used elsewhere. A list is a natural way to implement a symbol table in a functional language: A binding is a pair of a name and its associated object, and a symbol table is a list of such pairs. The operations are implemented in the following way: empty: An empty symbol table is an empty list. binding: A new binding (name/object pair) is added (cons ed) to the front of the list. lookup: The list is searched until a matching name is found. The object paired with the name is then returned. If the end of the list is reached, an indication that this happened is returned instead. This indication can be made by raising an exception or by letting the lookup function return a type that can hold both objects and error-indications, i.e., a sum-type. enter: The old list is remembered, i.e., a reference is made to it. exit: The old list is recalled, i.e., the above reference is used. The latter two operations are not really explicit operations. Entering and exiting a scope is done by binding a symbol table to a name before entering a new scope and then referring to this name again after the scope is exited. As new bindings are added to the front of the list, these will automatically take precedence over old bindings as the list is searched from the front to the back. Imperative Symbol Table -: Imperative symbol tables are natural to use if the compiler is written in an imperative language. A simple imperative symbol table can be implemented as a stack, which works in a way similar to the listbased functional implementation: empty: An empty symbol table is an empty stack. binding: A new binding (name/object pair) is pushed on top of the stack. lookup: The stack is searched top-to-bottom until a matching name is found. The object paired with the name is then returned. If the bottom of the stack is reached, we instead return an error-indication. enter: The top-of-stack pointer is remembered. exit: The old top-of-stack pointer is recalled and becomes the current. This is not quite a persistent data structure, as leaving a scope will destroy its symbol table. For most languages, this won t matter, as a scope isn t needed again after it is exited. If this is not the case, a real persistent symbol table must be used, or the needed parts of the symbol table must be stored for later retrieval before exiting a scope. Efficiency issues While all of the above implementations are simple, they all share the same efficiency problem: Lookup is done by linear search, so the worst-case time for lookup is proportional to the size of the symbol table. This is mostly a problem in relation to libraries: It is quite common for a program to use libraries that define literally hundreds of names. A common solution to this problem is hashing: Names are hashed (processed) into integers, which are used to index an array. Each array element is then a linear list of the bindings of names that share the same hash code. Given a large enough hash table, these lists will typically be very short, so lookup time is basically constant.

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan Compilers Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan Lexical Analyzer (Scanner) 1. Uses Regular Expressions to define tokens 2. Uses Finite Automata to recognize tokens

More information

Chapter 3 Lexical Analysis

Chapter 3 Lexical Analysis Chapter 3 Lexical Analysis Outline Role of lexical analyzer Specification of tokens Recognition of tokens Lexical analyzer generator Finite automata Design of lexical analyzer generator The role of lexical

More information

Compiler course. Chapter 3 Lexical Analysis

Compiler course. Chapter 3 Lexical Analysis Compiler course Chapter 3 Lexical Analysis 1 A. A. Pourhaji Kazem, Spring 2009 Outline Role of lexical analyzer Specification of tokens Recognition of tokens Lexical analyzer generator Finite automata

More information

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language. UNIT I LEXICAL ANALYSIS Translator: It is a program that translates one language to another Language. Source Code Translator Target Code 1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System

More information

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table COMPILER CONSTRUCTION Lab 2 Symbol table LABS Lab 3 LR parsing and abstract syntax tree construction using ''bison' Lab 4 Semantic analysis (type checking) PHASES OF A COMPILER Source Program Lab 2 Symtab

More information

CSE302: Compiler Design

CSE302: Compiler Design CSE302: Compiler Design Instructor: Dr. Liang Cheng Department of Computer Science and Engineering P.C. Rossin College of Engineering & Applied Science Lehigh University February 13, 2007 Outline Recap

More information

Figure 2.1: Role of Lexical Analyzer

Figure 2.1: Role of Lexical Analyzer Chapter 2 Lexical Analysis Lexical analysis or scanning is the process which reads the stream of characters making up the source program from left-to-right and groups them into tokens. The lexical analyzer

More information

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below. UNIT I Translator: It is a program that translates one language to another Language. Examples of translator are compiler, assembler, interpreter, linker, loader and preprocessor. Source Code Translator

More information

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou Administrative! [ALSU03] Chapter 3 - Lexical Analysis Sections 3.1-3.4, 3.6-3.7! Reading for next time [ALSU03] Chapter 3 Copyright (c) 2010 Ioanna

More information

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool Lexical Analysis Implementing Scanners & LEX: A Lexical Analyzer Tool Copyright 2016, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California

More information

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis CS 1622 Lecture 2 Lexical Analysis CS 1622 Lecture 2 1 Lecture 2 Review of last lecture and finish up overview The first compiler phase: lexical analysis Reading: Chapter 2 in text (by 1/18) CS 1622 Lecture

More information

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens Concepts Introduced in Chapter 3 Lexical Analysis Regular Expressions (REs) Nondeterministic Finite Automata (NFA) Converting an RE to an NFA Deterministic Finite Automatic (DFA) Lexical Analysis Why separate

More information

Languages and Compilers

Languages and Compilers Principles of Software Engineering and Operational Systems Languages and Compilers SDAGE: Level I 2012-13 4. Lexical Analysis (Scanning) Dr Valery Adzhiev vadzhiev@bournemouth.ac.uk Office: TA-121 For

More information

CSC 467 Lecture 3: Regular Expressions

CSC 467 Lecture 3: Regular Expressions CSC 467 Lecture 3: Regular Expressions Recall How we build a lexer by hand o Use fgetc/mmap to read input o Use a big switch to match patterns Homework exercise static TokenKind identifier( TokenKind token

More information

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input. flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input. More often than not, though, you ll want to use flex to generate a scanner that divides

More information

Introduction to Lexical Analysis

Introduction to Lexical Analysis Introduction to Lexical Analysis Outline Informal sketch of lexical analysis Identifies tokens in input string Issues in lexical analysis Lookahead Ambiguities Specifying lexical analyzers (lexers) Regular

More information

Parsing and Pattern Recognition

Parsing and Pattern Recognition Topics in IT 1 Parsing and Pattern Recognition Week 10 Lexical analysis College of Information Science and Engineering Ritsumeikan University 1 this week mid-term evaluation review lexical analysis its

More information

UNIT -2 LEXICAL ANALYSIS

UNIT -2 LEXICAL ANALYSIS OVER VIEW OF LEXICAL ANALYSIS UNIT -2 LEXICAL ANALYSIS o To identify the tokens we need some method of describing the possible tokens that can appear in the input stream. For this purpose we introduce

More information

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata Lexical Analysis Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata Phase Ordering of Front-Ends Lexical analysis (lexer) Break input string

More information

UNIT II LEXICAL ANALYSIS

UNIT II LEXICAL ANALYSIS UNIT II LEXICAL ANALYSIS 2 Marks 1. What are the issues in lexical analysis? Simpler design Compiler efficiency is improved Compiler portability is enhanced. 2. Define patterns/lexeme/tokens? This set

More information

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

CD Assignment I. 1. Explain the various phases of the compiler with a simple example. CD Assignment I 1. Explain the various phases of the compiler with a simple example. The compilation process is a sequence of various phases. Each phase takes input from the previous, and passes the output

More information

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer As the first phase of a compiler, the main task of the lexical analyzer is to read the input

More information

PRACTICAL CLASS: Flex & Bison

PRACTICAL CLASS: Flex & Bison Master s Degree Course in Computer Engineering Formal Languages FORMAL LANGUAGES AND COMPILERS PRACTICAL CLASS: Flex & Bison Eliana Bove eliana.bove@poliba.it Install On Linux: install with the package

More information

PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS

PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS Objective PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS Explain what is meant by compiler. Explain how the compiler works. Describe various analysis of the source program. Describe the

More information

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design i About the Tutorial A compiler translates the codes written in one language to some other language without changing the meaning of the program. It is also expected that a compiler should make the target

More information

PSD3A Principles of Compiler Design Unit : I-V. PSD3A- Principles of Compiler Design

PSD3A Principles of Compiler Design Unit : I-V. PSD3A- Principles of Compiler Design PSD3A Principles of Compiler Design Unit : I-V 1 UNIT I - SYLLABUS Compiler Assembler Language Processing System Phases of Compiler Lexical Analyser Finite Automata NFA DFA Compiler Tools 2 Compiler -

More information

1 Lexical Considerations

1 Lexical Considerations Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.035, Spring 2013 Handout Decaf Language Thursday, Feb 7 The project for the course is to write a compiler

More information

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer Assigned: Thursday, September 16, 2004 Due: Tuesday, September 28, 2004, at 11:59pm September 16, 2004 1 Introduction Overview In this

More information

COMPILER DESIGN LECTURE NOTES

COMPILER DESIGN LECTURE NOTES COMPILER DESIGN LECTURE NOTES UNIT -1 1.1 OVERVIEW OF LANGUAGE PROCESSING SYSTEM 1.2 Preprocessor A preprocessor produce input to compilers. They may perform the following functions. 1. Macro processing:

More information

Using an LALR(1) Parser Generator

Using an LALR(1) Parser Generator Using an LALR(1) Parser Generator Yacc is an LALR(1) parser generator Developed by S.C. Johnson and others at AT&T Bell Labs Yacc is an acronym for Yet another compiler compiler Yacc generates an integrated

More information

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast! Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast! Compiler Passes Analysis of input program (front-end) character stream

More information

Lexical Analysis (ASU Ch 3, Fig 3.1)

Lexical Analysis (ASU Ch 3, Fig 3.1) Lexical Analysis (ASU Ch 3, Fig 3.1) Implementation by hand automatically ((F)Lex) Lex generates a finite automaton recogniser uses regular expressions Tasks remove white space (ws) display source program

More information

1. Lexical Analysis Phase

1. Lexical Analysis Phase 1. Lexical Analysis Phase The purpose of the lexical analyzer is to read the source program, one character at time, and to translate it into a sequence of primitive units called tokens. Keywords, identifiers,

More information

Compiler Design. Computer Science & Information Technology (CS) Rank under AIR 100

Compiler Design. Computer Science & Information Technology (CS) Rank under AIR 100 GATE- 2016-17 Postal Correspondence 1 Compiler Design Computer Science & Information Technology (CS) 20 Rank under AIR 100 Postal Correspondence Examination Oriented Theory, Practice Set Key concepts,

More information

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012 Scanners Xiaokang Qiu Purdue University ECE 468 Adapted from Kulkarni 2012 August 24, 2016 Scanners Sometimes called lexers Recall: scanners break input stream up into a set of tokens Identifiers, reserved

More information

VIVA QUESTIONS WITH ANSWERS

VIVA QUESTIONS WITH ANSWERS VIVA QUESTIONS WITH ANSWERS 1. What is a compiler? A compiler is a program that reads a program written in one language the source language and translates it into an equivalent program in another language-the

More information

Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool

Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool Module 8 - Lexical Analyzer Generator This module discusses the core issues in designing a lexical analyzer generator from basis or using a tool. The basics of LEX tool are also discussed. 8.1 Need for

More information

CSCI312 Principles of Programming Languages!

CSCI312 Principles of Programming Languages! CSCI312 Principles of Programming Languages!! Chapter 3 Regular Expression and Lexer Xu Liu Recap! Copyright 2006 The McGraw-Hill Companies, Inc. Clite: Lexical Syntax! Input: a stream of characters from

More information

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; } Ex: The difference between Compiler and Interpreter The interpreter actually carries out the computations specified in the source program. In other words, the output of a compiler is a program, whereas

More information

2. λ is a regular expression and denotes the set {λ} 4. If r and s are regular expressions denoting the languages R and S, respectively

2. λ is a regular expression and denotes the set {λ} 4. If r and s are regular expressions denoting the languages R and S, respectively Regular expressions: a regular expression is built up out of simpler regular expressions using a set of defining rules. Regular expressions allows us to define tokens of programming languages such as identifiers.

More information

CS 403: Scanning and Parsing

CS 403: Scanning and Parsing CS 403: Scanning and Parsing Stefan D. Bruda Fall 2017 THE COMPILATION PROCESS Character stream Scanner (lexical analysis) Token stream Parser (syntax analysis) Parse tree Semantic analysis Abstract syntax

More information

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES THE COMPILATION PROCESS Character stream CS 403: Scanning and Parsing Stefan D. Bruda Fall 207 Token stream Parse tree Abstract syntax tree Modified intermediate form Target language Modified target language

More information

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; } Ex: The difference between Compiler and Interpreter The interpreter actually carries out the computations specified in the source program. In other words, the output of a compiler is a program, whereas

More information

The Structure of a Syntax-Directed Compiler

The Structure of a Syntax-Directed Compiler Source Program (Character Stream) Scanner Tokens Parser Abstract Syntax Tree Type Checker (AST) Decorated AST Translator Intermediate Representation Symbol Tables Optimizer (IR) IR Code Generator Target

More information

The analysis part breaks up the source program into constituent pieces and creates an intermediate representation of the source program.

The analysis part breaks up the source program into constituent pieces and creates an intermediate representation of the source program. COMPILER DESIGN 1. What is a compiler? A compiler is a program that reads a program written in one language the source language and translates it into an equivalent program in another language-the target

More information

Lexical Analyzer Scanner

Lexical Analyzer Scanner Lexical Analyzer Scanner ASU Textbook Chapter 3.1, 3.3, 3.4, 3.6, 3.7, 3.5 Tsan-sheng Hsu tshsu@iis.sinica.edu.tw http://www.iis.sinica.edu.tw/~tshsu 1 Main tasks Read the input characters and produce

More information

COMPILER DESIGN. For COMPUTER SCIENCE

COMPILER DESIGN. For COMPUTER SCIENCE COMPILER DESIGN For COMPUTER SCIENCE . COMPILER DESIGN SYLLABUS Lexical analysis, parsing, syntax-directed translation. Runtime environments. Intermediate code generation. ANALYSIS OF GATE PAPERS Exam

More information

Formal Languages and Compilers Lecture VI: Lexical Analysis

Formal Languages and Compilers Lecture VI: Lexical Analysis Formal Languages and Compilers Lecture VI: Lexical Analysis Free University of Bozen-Bolzano Faculty of Computer Science POS Building, Room: 2.03 artale@inf.unibz.it http://www.inf.unibz.it/ artale/ Formal

More information

Lexical Analyzer Scanner

Lexical Analyzer Scanner Lexical Analyzer Scanner ASU Textbook Chapter 3.1, 3.3, 3.4, 3.6, 3.7, 3.5 Tsan-sheng Hsu tshsu@iis.sinica.edu.tw http://www.iis.sinica.edu.tw/~tshsu 1 Main tasks Read the input characters and produce

More information

CST-402(T): Language Processors

CST-402(T): Language Processors CST-402(T): Language Processors Course Outcomes: On successful completion of the course, students will be able to: 1. Exhibit role of various phases of compilation, with understanding of types of grammars

More information

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised: EDAN65: Compilers, Lecture 06 A LR parsing Görel Hedin Revised: 2017-09-11 This lecture Regular expressions Context-free grammar Attribute grammar Lexical analyzer (scanner) Syntactic analyzer (parser)

More information

CS606- compiler instruction Solved MCQS From Midterm Papers

CS606- compiler instruction Solved MCQS From Midterm Papers CS606- compiler instruction Solved MCQS From Midterm Papers March 06,2014 MC100401285 Moaaz.pk@gmail.com Mc100401285@gmail.com PSMD01 Final Term MCQ s and Quizzes CS606- compiler instruction If X is a

More information

CSc 453 Lexical Analysis (Scanning)

CSc 453 Lexical Analysis (Scanning) CSc 453 Lexical Analysis (Scanning) Saumya Debray The University of Arizona Tucson Overview source program lexical analyzer (scanner) tokens syntax analyzer (parser) symbol table manager Main task: to

More information

Compiler Design. Subject Code: 6CS63/06IS662. Part A UNIT 1. Chapter Introduction. 1.1 Language Processors

Compiler Design. Subject Code: 6CS63/06IS662. Part A UNIT 1. Chapter Introduction. 1.1 Language Processors Compiler Design Subject Code: 6CS63/06IS662 Part A UNIT 1 Chapter 1 1. Introduction 1.1 Language Processors A compiler is a program that can read a program in one language (source language) and translate

More information

CMSC 350: COMPILER DESIGN

CMSC 350: COMPILER DESIGN Lecture 11 CMSC 350: COMPILER DESIGN see HW3 LLVMLITE SPECIFICATION Eisenberg CMSC 350: Compilers 2 Discussion: Defining a Language Premise: programming languages are purely formal objects We (as language

More information

UNIT III & IV. Bottom up parsing

UNIT III & IV. Bottom up parsing UNIT III & IV Bottom up parsing 5.0 Introduction Given a grammar and a sentence belonging to that grammar, if we have to show that the given sentence belongs to the given grammar, there are two methods.

More information

Lexical Considerations

Lexical Considerations Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.035, Fall 2005 Handout 6 Decaf Language Wednesday, September 7 The project for the course is to write a

More information

Part 5 Program Analysis Principles and Techniques

Part 5 Program Analysis Principles and Techniques 1 Part 5 Program Analysis Principles and Techniques Front end 2 source code scanner tokens parser il errors Responsibilities: Recognize legal programs Report errors Produce il Preliminary storage map Shape

More information

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

COP4020 Programming Languages. Syntax Prof. Robert van Engelen COP4020 Programming Languages Syntax Prof. Robert van Engelen Overview n Tokens and regular expressions n Syntax and context-free grammars n Grammar derivations n More about parse trees n Top-down and

More information

Yacc: A Syntactic Analysers Generator

Yacc: A Syntactic Analysers Generator Yacc: A Syntactic Analysers Generator Compiler-Construction Tools The compiler writer uses specialised tools (in addition to those normally used for software development) that produce components that can

More information

Monday, August 26, 13. Scanners

Monday, August 26, 13. Scanners Scanners Scanners Sometimes called lexers Recall: scanners break input stream up into a set of tokens Identifiers, reserved words, literals, etc. What do we need to know? How do we define tokens? How can

More information

LECTURE NOTES ON COMPILER DESIGN P a g e 2

LECTURE NOTES ON COMPILER DESIGN P a g e 2 LECTURE NOTES ON COMPILER DESIGN P a g e 1 (PCCS4305) COMPILER DESIGN KISHORE KUMAR SAHU SR. LECTURER, DEPARTMENT OF INFORMATION TECHNOLOGY ROLAND INSTITUTE OF TECHNOLOGY, BERHAMPUR LECTURE NOTES ON COMPILER

More information

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 5

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 5 CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 5 CS 536 Spring 2015 1 Multi Character Lookahead We may allow finite automata to look beyond the next input character.

More information

Wednesday, September 3, 14. Scanners

Wednesday, September 3, 14. Scanners Scanners Scanners Sometimes called lexers Recall: scanners break input stream up into a set of tokens Identifiers, reserved words, literals, etc. What do we need to know? How do we define tokens? How can

More information

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer. Compiler Design A compiler is computer software that transforms computer code written in one programming language (the source language) into another programming language (the target language). The name

More information

Compilers and Interpreters

Compilers and Interpreters Overview Roadmap Language Translators: Interpreters & Compilers Context of a compiler Phases of a compiler Compiler Construction tools Terminology How related to other CS Goals of a good compiler 1 Compilers

More information

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications Agenda for Today Regular Expressions CSE 413, Autumn 2005 Programming Languages Basic concepts of formal grammars Regular expressions Lexical specification of programming languages Using finite automata

More information

Yacc: Yet Another Compiler-Compiler

Yacc: Yet Another Compiler-Compiler Stephen C. Johnson ABSTRACT Computer program input generally has some structure in fact, every computer program that does input can be thought of as defining an input language which it accepts. An input

More information

Zhizheng Zhang. Southeast University

Zhizheng Zhang. Southeast University Zhizheng Zhang Southeast University 2016/10/5 Lexical Analysis 1 1. The Role of Lexical Analyzer 2016/10/5 Lexical Analysis 2 2016/10/5 Lexical Analysis 3 Example. position = initial + rate * 60 2016/10/5

More information

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR Pune Vidyarthi Griha s COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR By Prof. Anand N. Gharu (Assistant Professor) PVGCOE Computer Dept.. 22nd Jan 2018 CONTENTS :- 1. Role of lexical analysis 2.

More information

Compiler phases. Non-tokens

Compiler phases. Non-tokens Compiler phases Compiler Construction Scanning Lexical Analysis source code scanner tokens regular expressions lexical analysis Lennart Andersson parser context free grammar Revision 2011 01 21 parse tree

More information

Compiler Construction D7011E

Compiler Construction D7011E Compiler Construction D7011E Lecture 2: Lexical analysis Viktor Leijon Slides largely by Johan Nordlander with material generously provided by Mark P. Jones. 1 Basics of Lexical Analysis: 2 Some definitions:

More information

COMPILER CONSTRUCTION Seminar 02 TDDB44

COMPILER CONSTRUCTION Seminar 02 TDDB44 COMPILER CONSTRUCTION Seminar 02 TDDB44 Martin Sjölund (martin.sjolund@liu.se) Adrian Horga (adrian.horga@liu.se) Department of Computer and Information Science Linköping University LABS Lab 3 LR parsing

More information

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/} Class 5 Lex Spec Example delim [ \t\n] ws {delim}+ letter [A-Aa-z] digit [0-9] id {letter}({letter} {digit})* number {digit}+(\.{digit}+)?(e[+-]?{digit}+)? %% {ws} {/*no action and no return*?} if {return(if);}

More information

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

COP4020 Programming Languages. Syntax Prof. Robert van Engelen COP4020 Programming Languages Syntax Prof. Robert van Engelen Overview Tokens and regular expressions Syntax and context-free grammars Grammar derivations More about parse trees Top-down and bottom-up

More information

Gechstudentszone.wordpress.com

Gechstudentszone.wordpress.com UNIT - 8 LEX AND YACC 2 8.1 USING YACC Yacc provides a general tool for describing the input to a computer program. The Yacc user specifies the structures of his input, together with code to be invoked

More information

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD GROUP - B EXPERIMENT NO : 07 1. Title: Write a program using Lex specifications to implement lexical analysis phase of compiler to total nos of words, chars and line etc of given file. 2. Objectives :

More information

Ulex: A Lexical Analyzer Generator for Unicon

Ulex: A Lexical Analyzer Generator for Unicon Ulex: A Lexical Analyzer Generator for Unicon Katrina Ray, Ray Pereda, and Clinton Jeffery Unicon Technical Report UTR 02a May 21, 2003 Abstract Ulex is a software tool for building language processors.

More information

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs Chapter 2 :: Programming Language Syntax Programming Language Pragmatics Michael L. Scott Introduction programming languages need to be precise natural languages less so both form (syntax) and meaning

More information

Chapter 3: Lexing and Parsing

Chapter 3: Lexing and Parsing Chapter 3: Lexing and Parsing Aarne Ranta Slides for the book Implementing Programming Languages. An Introduction to Compilers and Interpreters, College Publications, 2012. Lexing and Parsing* Deeper understanding

More information

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis. Topics Chapter 4 Lexical and Syntax Analysis Introduction Lexical Analysis Syntax Analysis Recursive -Descent Parsing Bottom-Up parsing 2 Language Implementation Compilation There are three possible approaches

More information

A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994

A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994 A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994 Andrew W. Appel 1 James S. Mattson David R. Tarditi 2 1 Department of Computer Science, Princeton University 2 School of Computer

More information

The Language for Specifying Lexical Analyzer

The Language for Specifying Lexical Analyzer The Language for Specifying Lexical Analyzer We shall now study how to build a lexical analyzer from a specification of tokens in the form of a list of regular expressions The discussion centers around

More information

Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No. # 01 Lecture No. # 01 An Overview of a Compiler This is a lecture about

More information

Lexical Analysis. Introduction

Lexical Analysis. Introduction Lexical Analysis Introduction Copyright 2015, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California have explicit permission to make copies

More information

Lexical Analysis. Chapter 2

Lexical Analysis. Chapter 2 Lexical Analysis Chapter 2 1 Outline Informal sketch of lexical analysis Identifies tokens in input string Issues in lexical analysis Lookahead Ambiguities Specifying lexers Regular expressions Examples

More information

Lexical Analysis. Lecture 3. January 10, 2018

Lexical Analysis. Lecture 3. January 10, 2018 Lexical Analysis Lecture 3 January 10, 2018 Announcements PA1c due tonight at 11:50pm! Don t forget about PA1, the Cool implementation! Use Monday s lecture, the video guides and Cool examples if you re

More information

Implementation of Lexical Analysis

Implementation of Lexical Analysis Implementation of Lexical Analysis Outline Specifying lexical structure using regular expressions Finite automata Deterministic Finite Automata (DFAs) Non-deterministic Finite Automata (NFAs) Implementation

More information

Ray Pereda Unicon Technical Report UTR-02. February 25, Abstract

Ray Pereda Unicon Technical Report UTR-02. February 25, Abstract iflex: A Lexical Analyzer Generator for Icon Ray Pereda Unicon Technical Report UTR-02 February 25, 2000 Abstract iflex is software tool for building language processors. It is based on flex, a well-known

More information

PESIT Bangalore South Campus Hosur road, 1km before Electronic City, Bengaluru -100 Department of Computer Science and Engineering

PESIT Bangalore South Campus Hosur road, 1km before Electronic City, Bengaluru -100 Department of Computer Science and Engineering TEST 1 Date : 24 02 2015 Marks : 50 Subject & Code : Compiler Design ( 10CS63) Class : VI CSE A & B Name of faculty : Mrs. Shanthala P.T/ Mrs. Swati Gambhire Time : 8:30 10:00 AM SOLUTION MANUAL 1. a.

More information

When do We Run a Compiler?

When do We Run a Compiler? When do We Run a Compiler? Prior to execution This is standard. We compile a program once, then use it repeatedly. At the start of each execution We can incorporate values known at the start of the run

More information

Compilers. Prerequisites

Compilers. Prerequisites Compilers Prerequisites Data structures & algorithms Linked lists, dictionaries, trees, hash tables Formal languages & automata Regular expressions, finite automata, context-free grammars Machine organization

More information

Program Analysis ( 软件源代码分析技术 ) ZHENG LI ( 李征 )

Program Analysis ( 软件源代码分析技术 ) ZHENG LI ( 李征 ) Program Analysis ( 软件源代码分析技术 ) ZHENG LI ( 李征 ) lizheng@mail.buct.edu.cn Lexical and Syntax Analysis Topic Covered Today Compilation Lexical Analysis Semantic Analysis Compilation Translating from high-level

More information

Lexical Considerations

Lexical Considerations Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.035, Spring 2010 Handout Decaf Language Tuesday, Feb 2 The project for the course is to write a compiler

More information

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres dgriol@inf.uc3m.es Introduction: Definitions Lexical analysis or scanning: To read from left-to-right a source

More information

Appendix Set Notation and Concepts

Appendix Set Notation and Concepts Appendix Set Notation and Concepts In mathematics you don t understand things. You just get used to them. John von Neumann (1903 1957) This appendix is primarily a brief run-through of basic concepts from

More information

Automatic Scanning and Parsing using LEX and YACC

Automatic Scanning and Parsing using LEX and YACC Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

More information

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions CSE 413 Programming Languages & Implementation Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions 1 Agenda Overview of language recognizers Basic concepts of formal grammars Scanner Theory

More information

CSE P 501 Exam 11/17/05 Sample Solution

CSE P 501 Exam 11/17/05 Sample Solution 1. (8 points) Write a regular expression or set of regular expressions that generate the following sets of strings. You can use abbreviations (i.e., name = regular expression) if it helps to make your

More information

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective Chapter 4 Lexical analysis Lexical scanning Regular expressions DFAs and FSAs Lex Concepts CMSC 331, Some material 1998 by Addison Wesley Longman, Inc. 1 CMSC 331, Some material 1998 by Addison Wesley

More information