CS 4201 Compilers 2014/2015 Handout: Lab 1 Lab Content: - What is compiler? - What is compilation? - Features of compiler - Compiler structure - Phases of compiler - Programs related to compilers - Some data structures What is compiler? Compiler: A program that translates an executable program in one language into an executable program in another language. Interpreter: A program that reads an executable program and produces the results of executing that program. Compiler: It is also called Recognizer or Translator High level language Source Program Error messages Compiler Assembly or machine language Target Program Source = Target Source and Target must be equivalent Major tasks: Analysis of the source program Synthesis of the target language instruction What is compilation? Translation of a program written in a source language into a semantically equivalent program written in a target language. Features of compiler Correctness: preserve the meaning of the code Speed of target code (Translation) Speed of compilation Good error reporting/handling Eng. Maha Talaat DBS Lab (1) 1
Cooperation with the debugger Support for separate compilation Compiler structure Source code Front End IR Back End Phases of compiler Phases of compiler: o Scanning, o Parsing, o Semantic analyzer, o Intermediate code generation, o Intermediate code optimizer, o generation, o optimizer. Auxiliary components interact with phases: o Literal table, o Symbol table, Literal table o Error handler. Flow chart of a typical Compiler: Symbol table Error handler Source code Scanner Parser Semantic analyzer Tokens Intermediate code generation Intermediate code optimization generation Syntax/Parse tree Annotated tree Intermediate code Intermediate code (optimized) optimization Eng. Maha Talaat DBS Lab (1) 2
Some Notations Source code Sequence of characters Lexical analyzer Sequence of tokens Syntax analyzer Abstract syntax tree Semantic analyzer Annotated abstract syntax tree Intermediate code generator Intermediate code Intermediate code optimizer Intermediate code Code generator Scanner: Actions: o Reads characters from the source program, o Groups characters into lexemes (sequence of characters that go together) following a given pattern, o Each lexeme corresponds to a Token, (the scanner returns the next token to the parser) o The scanner may also discover lexical errors (erroneous characters) The definition of what a lexeme, token or bad character is depend on the definition of the source language. Tokens: Represent basic program entities such as: o Identifiers, o Literals, o Reserved words, o Operators, o Delimiters etc. Ex for C: C sentence: L 1 : x = y 2 + 12; Lexeme: L 1 : x = y 2 + 12 ; Token ID Colon ID Assign ID Plus Op INT Semicolon Arbitrary number of blanks between lexemes. Erroneous sequence of characters for C language: Control characters, @ 2abc Parser: o Group tokens into grammatical phrases, to discover the underlying structure of the source, o Find syntax errors. Lexeme: index = * 12 ; Token: ID Assign Times INT Semicolon Eng. Maha Talaat DBS Lab (1) 3
Every token is legal, but the sequence is erroneous. May find static semantic errors. Use of undeclared variables or multiple declared variables, May generate code, or build some intermediate representation of the source program, such as: Abstract syntax tree Ex for C: Source code: position = initial + rate * 60; Abstract syntax tree: = Position + Initial * Semantic analyzer: Rate 60 o Check for more static semantic errors o May annotate and/or change the abstract syntax tree with type information. Semantics consist of: o Runtime semanticsbehavior of program at runtime. o Static semantics checked by the compiler. Include: Declarations of variables and constants before use, Calling functions that exist, Passing parameters properly, Type checking. Annotated syntax tree: = Position (Float) Initial (Float) + * Rate 60 Intermediate code generator: Actions: (Float) Int-to-Float () o Translate from abstract syntax tree to intermediate code, o Intermediate representation should have 2 important properties: Should be easy to produce, Should be easy to translate into the target program. o Intermediate representation can have a variety of forms: Three address code, Eng. Maha Talaat DBS Lab (1) 4
P-code for an abstract machine, Tree or DAG representation. 3-address code: Each statement contains: At most 3 operands, In addition to := (assignment), at most one operator, An easy and universal format to be translated into most assembly languages. Temp1:= int_to_float (60) Temp2:= rate * Temp1 Temp3:= initial + Temp2 Position:= Temp3. Optimizer: o Improve the efficiency of intermediate code, o Goal may be to make code run faster, and/or make the code smaller. Temp2:= rate *60.0 Position:= initial + Temp2 Code generation: o Compiler generates: Pure machine codes, or assembly code, Virtual machine code. o Allocates memory locations for variables, o Allocates registers for intermediate computations. LOADF rate, R1 MULF #60.0, R1 LOADF initial, R2 ADDF R2, R1 STOREF R1, position Code optimization: Applied to: o Intermediate code: Elimination of common sub-expressions, Identification and elimination of unreachable code, Improving function calls, Improving loops. o : Allocation and use of registers, Selection of better (faster) instructions and addressing modes. Eng. Maha Talaat DBS Lab (1) 5
Programs related to compilers Pre-processor: Produces input to a compiler, Performs the following: o Macro processing, o File inclusion. Assembler: o Translator for the assembly language, o Two-pass assembly: All variables are allocated storage locations, Assembler code is translated into machine code, o Output is Relocatable Machine Code Linkers: o Links object files separately compiled or assembled, o Links object files to standard library functions, o Generates a file that can be loaded and executed. Debuggers: o Used to determine execution error in a compiled program, o Keep tracks of most or all of the source code information. Editors: May include some operations of a compiler, informing some errors. C or C++ program Processor Complier Assembler C or C++ program with macro substitution and file inclusions Assembly code Relocatable object module Linker Other object modules or library modules Executable code Eng. Maha Talaat DBS Lab (1) 6
Major Data Structures in a Compiler Principle Data Structure for Communication among Phases TOKENS A scanner collects characters into a token, as a value of an enumerated data type for tokens May also preserve the string of characters or other derived information, such as name of identifier, value of a number token A single global variable or an array of tokens THE SYNTAX TREE A standard pointer-based structure generated by parser Each node represents information collect by parser or later, which maybe dynamically allocated or stored in symbol table The node requires different attributes depending on kind of language structure, which may be represented as variable record. THE SYMBOL TABLE Keeps information associated with identifiers: function, variable, constants, and data types Interacts with almost every phase of compiler. Access operation need to be constant-time One or several hash tables are often used, THE LITERAL TABLE Stores constants and strings, reducing size of program Quick insertion and lookup are essential INTERMEDIATE CODE Kept as an array of text string, a temporary text, or a linked list of structures, depending on kind of intermediate code (e.g. three-address code and p-code) Should be easy for reorganization TEMPORARY FILES Holds the product of intermediate steps during compiling Solve the problem of memory constraints or back-patch addressed during code generation Syntax versus Semantic errors A syntax error occurs when you write code that violates the rules of grammar of the programming language. Syntax errors are detected by the compiler Semantic error means writing a valid programming structure with invalid logic. Semantic errors can be broken down into Static semantic errors which are detected by the compiler Runtime errors and logic errors: Runtime errors cause the program to crash or abort in some way, while logic errors cause a program to run to completion, but produce the incorrect output or result. Those errors are not detected by the compiler. Eng. Maha Talaat DBS Lab (1) 7