University of Edinburgh Division of Informatics

Size: px
Start display at page:

Download "University of Edinburgh Division of Informatics"

Transcription

1 University of Edinburgh Division of Informatics Developing an Alpha/Tru64 Port of Diablo Diablo Is A Better Link time Optimiser 4th Year Project Report Computer Science Peter Morrow: p.d.morrow@sms.ed.ac.uk February 27, 2007 Abstract: Traditional optimising compilers currently do a fairly good job of producing efficient assembly code. However they are limited with regards to the scope in which they can perform program optimisations, I.e. only the source modules which make up a program. A number of other restrictions are also in place when attempting to perform optimisations at compile time. In this dissertation I describe the implementation of a whole program optimiser for the Tru64 UNIX / Alpha platform which operates after the compilation and linking stages of program development. Performing optimisations after a program has been compiled and linked removes the restrictions in place at compile time, hence I have noticed on average 4.28% improvement in execution time and 13.23% in binary size.

2

3 0.1 Acknowledgements My supervisor Professor Michael O Boyle for his many excellent suggestions and guidance throughout the project. Research assistant Timothy Jones for his excellent assistance with a wide range of technical issues, ideas, and support. Bruno De Bus of the PARIS group at the University of Ghent for providing great support whenever I had a problem with Diablo. The Universitat Politecnica De Catalunya for allowing me access to their Alpha based systems running the Tru64 UNIX operation system.

4

5 Contents 0.1 Acknowledgements iii 1 Introduction Motivation Compiler Optimisation Project Structure Dissertation Structure Background Why Link Time Optimisation? Diablo Statically and Dynamically Linked Binaries The Alpha Architecture Alpha PALcode Related Work Classical Optimisations Optimisation Techniques and Trade offs Existing Tools OM ALTO Squeeze, Squeeze++ & Diablo Diota Link Time Optimisation Techniques Specific Code Optimisation Work Summary The Alpha Port The Diablo Framework The Link Layer Diablo Classes The Flowgraph Layer Front end and Back end Interaction Alpha Architecture Representation in Diablo Alpha Instruction Representation in Diablo Disassembly Instruction Formats Disassembly Functions Control Flow Graph Creation Basic Block Leader Detection [2] v

6 4.3.2 Addition of Edges to the CFG [2] Deflowgraphing (Code Layout) Creating Chains of Basic Blocks Fix Point Calculations Assembler Summary Optimisations Useless Instructions Removal Removal of Dead Code Removal of GP Re-Computations Removal of No-ops Branch Optimisations Elimination of Indirect Jumps Removal of Useless Conditional Branches Summary Results Test Environment Instruction Elimination Diablo Base: Dead Code Removal Only Removal of Nops Removal of GP Re-computations Combined Instruction Elimination and Dead code Elimination Conclusion Branch Optimisations Branch Conversions Combined Branch and Instruction Elimination Improved Combinations Summary Conclusion Further Work Optimisations Bibliography 67 A Code Listing 71 B Running Diablo 73 B.1 Compiling Diablo B.2 Preparing A Program For Optimisation B.3 Using Diablo

7 1. Introduction 1.1 Motivation Nowadays even as computer hardware becomes more freely available and also at a cheaper cost there is still the need for producing programs which use less disk space, run faster and which run with a smaller memory footprint. There is no point in programs not taking advantage of the advances in hardware! Embedded devices are also becoming more and more popular; these devices often have limited storage space and limited memory on board hence we need to produce more efficient binaries without sacrificing correctness. Optimisation programs after linking complements traditional optimising compilers. We can push optimisations to the limit by combining compile time and post link time optimisations, with this in mind it makes sense to go that extra mile and optimise at link time too. 1.2 Compiler Optimisation Compiler optimisation is the process of analysing programming language constructs and attempting to transform these constructs into more efficient forms. Efficiency in this sense relates to execution time of a program, the size of the program, the amount of power needed to execute a program on a given CPU. Most modern compilers include an optimisation phase which attempts to increase the efficiency of a program, a user may specify the optimisation level at which the compiler optimises code. The optimisation phase of a compiler takes place after source files have been converted into an intermediate representation. This intermediate form represents all of the syntactical and semantic information gained from the source files and eventually is used to build graphs with which analysis can be performed and hence transformations can be applied. Figure 1.1 shows shows this process. A compilers underlying purpose is to generate correct assembly code and nothing more. Other external tools are used to create a final binary which a user can run; an assembler is needed to transform the assembly code instructions into machine code instruction which are stored in object files (.o files). A linker is also needed to combine the generated object files and to fix any undefined references to code or data not within the current object file. For example there might be a call to the printf function in the generated assembly code. The machine code instruction has no idea were the printf routine is located, so there needs to be some method to locate the routine - when the program is linked 1

8 2 1. INTRODUCTION the linker will search its library paths for the printf routine and will replace all references to printf with a physical address with which instructions can reference this function. Usually this process is transparent to the user, for example the GNU C compiler will perform all of these operations on the users behalf unless they specify otherwise. Figure 1.1: Compilation Process In this dissertation I describe optimising a program after it has been compiled, assembled and linked. It turns out after link time or during link time (note, from now on these terms will be interspersed, they mean in the same thing in the context of this dissertation) we can perform even more optimisations even if the compiler was able to perform aggressive program optimisations. This is generally due to the information provided to us by the linker. This information allows us to, for example, calculate indirect jump targets. Information such as this can be used to improve program execution time by converting costly indirect jumps into PC relative jumps. An optimisation such as this cannot be performed by the compiler - this is since the compiler might not know the target of the jump, it could be in a far off library somewhere in the standard C library. It is the job of the linker, and not the compiler to determine this kind of information. Possibly the biggest advantage of performing optimisations after a program has been linked is the idea of a global view of the program. After linking we have access to all object files; these object files contain external functions which programmers use in everyday programs. Functions such as malloc, printf, qsort etc can all be optimised for specific calling contexts. Compilers simply cannot optimise these external library functions because they do not see the high level source code of these functions, they see nothing but the source which a user has supplied! This majority of this dissertation is devoted to the development of a port of an existing post link time optimiser to another architecture, the Alpha. The remainder of the dissertation focuses on the optimisations which are available to us at link time, I discuss the optimisations which I have implemented in the port as well as further optimisation related work which will eventually make it into the port. A link time optimisation tool such as the one which I have ported works by reading the a binary along with all object files which make up a program. With these

9 1.3. PROJECT STRUCTURE 3 files and a linker map file we can emulate the operation of a linker, essentially we re-link the object files and binary using the linker map file. Doing so gives us a global view of the program with which we can perform additional program optimisations. The operation of an optimiser which works post link is described in figure 1.2. This diagram essentially shows the portions of the port which need to be implemented. Thankfully the portion of the port which reads the inputs to Diablo was already implemented, therefore it was my task to implement the disassembly, assembly, control flow graph creation, code layout, and basic optimisation phases of the Alpha port of Diablo. A control flow graph is a representation of a binary program where nodes are blocks of code connected by edges which describe possible control transfers. This data structure is an underlying concept in an optimisation tool, we perform the analysis and optimisations phases on this generated control flow graph. Figure 1.2: Post link optimisation 1.3 Project Structure The main goal of the project was to port a backend of an existing post link time optimiser to another architecture. On completing this phase of the project I continued working on the code base and extended it to perform some program optimisations which have resulted in good reductions in both binary size and execution time. In the dissertation I have written a considerable amount on the research area that is compiler optimisation, this is so to attempt to familiarise the reader with many of the concepts in the field and in turn to ease them through the rest of the dissertation. I describe my implementation of the port which was completed in around 3500 lines of code in the C programming language. Following this I describe the optimisations which I implemented, how these optimisations affected program execution time and program size. Finally I give a summary of the project and lead to further work which might be undertaken on this project.

10 4 1. INTRODUCTION 1.4 Dissertation Structure I begin this dissertation with a chapter devoted to background information on the project. In this chapter I explain in more depth what exactly link time optimisation is, why it is useful and the applications of performing transformations after linking. Following this I give an introduction to the tool which I will be porting, I then differentiate between static and dynamically linked binaries. Finally I give an overview of the Alpha architecture pointing out any interesting features and also notable information about the architecture which will be assumed in later chapters. Chapter 3 Describes related work within the field of compiler optimisation, it also describes optimisation techniques and papers which are specific to post link time optimisation. I also devote some time to describe other tool which are used for object level modification. Chapter 4 is devoted to the implementation of the Alpha port. The chapter begins by introducing the reader to the Diablo framework, this is followed by implementation details of the actual port itself. Chapter 5 describes the various optimisations which I decided to implement having completed work on the backend of the port. Here I mentioned a few optimisations which related to code size one which related to improving execution time. Chapter 6 reports on the results observed from applying the optimisations on a set of benchmarks. The benefits and pitfalls of each optimisation is looked at in depth. Chapter 7 is the final chapter - I give my conclusions on the success of the project and mention any further work that could be merged into the Alpha port. Appendix A gives a very brief description of the source files which I wrote, this is primarily to point someone who is interested in taking on this work in the right direction. Appendix B gives guidelines for using the tool and preparing programs for optimisation after linking.

11 2. Background In this chapter I discuss information which I think plays an important role in the project. I begin by explaining in more depth just exactly what post link time optimisation is and why we should perform it. Following this I introduce the Diablo tool and talk a little about binary files. I finish the chapter by describing the underlying architecture to which Diablo will be ported. 2.1 Why Link Time Optimisation? Typically code optimisation falls within the realm of the compiler. This is since the compiler has access to much more semantic and syntactical information such as data types and data structures. Inter procedural optimisations are possible at compile time, however they are restricted to the procedures within a single source code file. For example, external libraries such as the standard C library which are linked to a program cannot be optimised within this specific programs context. Generally aggressive procedural and inter procedural optimisations will result good results with regards to speed and code size, however there are many opportunities for even better results if we think about performing transformations after link time. At link-time we have access to all of the object files which make up a program so we are able to perform classical optimisations at the basic block, instruction, and whole program (external libraries and all) levels. Hence we can expect to see gains in many areas of program efficiency such as speed, binary size and power consumption. Link-time optimisation sounds like a perfect solution to getting the most of out a binary, however there are some drawbacks to using this approach to perform code optimisations: Optimisation at the object level is non-trivial. It can be difficult to extract any real meaning from an object file (which in essence is really just a series of machine code instructions) whereas with a source file it is easier to analyse the semantics of a piece of code. Having semantic information about a program makes it easier to perform intelligent optimisations. At source level strange code constructs are rare (programmers usually do not write strange code), so analysis at the source level is generally easier. At machine code level, the compiler may generate strange code and hence strange code sequences are much more frequent. This makes our task that little bit harder. Generally the generated binaries are larger than the source files, couple this 5

12 6 2. BACKGROUND with the difficulty of object optimisation and the problem becomes harder again. Of course there are advantages to link-time optimisation as opposed to compile time optimisation: We can be language independent! Object file formats obviously do not differ from language to language, so we can effectively apply our optimisations to any source language. We can perform optimisations which are not available at compile time such as jump conversions. In this example the compiler can often not determine the target of a jump hence the most conservative jump instruction must be chosen. On the Alpha this will usually translate to either a JMP or JSR (jump sub routine) instruction which are costly because they need to read from memory before they perform their jump. If we know the target (which the compiler won t) we can replace these instructions with other more efficient instructions. We can perform whole program optimisations. That is we can optimise functions and instructions that we did not write, I.e. those in external libraries. Optimising at link-time is the only way to optimise an entire program. In this dissertation I only consider the application of optimisation after link time however many other applications of post link binary re-writing exist and may be useful in certain circumstances: Translation to a different architecture - Since we have all the information that a program has to offer, complete binary translation to another architecture can be performed. DEC (now Compaq) successfully created a binary translator to run x86 binaries on DEC s Alpha systems[5], Intel and Transmeta corporation have also created similar translation systems. Customisation - Say a processor developer updated their instruction set to include some new instructions. Usually you might have to wait for the compiler to add support for these instructions, so to take advantage of these new features you might have to wait a long time. By using a link time translation program, we could easily add support for the new instructions quickly and easily. Another example might be replacing instructions that may have been removed from an instruction set with instructions that are valid. Instrumentation - A common reason for instrumenting a binary is to obtain profile information. For example, finding the hottest paths through a program for common cases. Obtaining information such as this can be used to create a profile which we can use to optimise specific parts of the

13 2.2. DIABLO 7 program. The FIT (Flexible Instrumentation Toolkit) [3] tool developed by the PARIS group at the University was created to do just this. This tool is a member of the Diablo family, however sadly a port to Alpha does not exist at this time. Another application of instrumentation is checking the safeness of a program. For example the usual but slightly dated method for writing a buffer overflow is to change the return address of a function to point a piece of code such as a setuid shell. It would be possible post link to check that all return addresses lie in a valid area of memory. Code Obfuscation - Code obfuscation is a technique which makes binary programs harder to understand and hence harder to reverse engineer. Performing obfuscation at link time yields much better results that at compile time, this is since as usual, external functions can be obfuscated. Calls to functions in a static context can be disguised as calls to other functions. Loco [17] is a code obfuscation tool developed by the PARIS group at the University of Ghent, Belgium. 2.2 Diablo Diablo (Diablo Is A Better Link-Time Optimiser) is a retargetable binary rewriting framework developed by the PARIS 1 research group at the University of Ghent in Belgium. Currently ports exists for MIPS, x86, ARM, IA64, and currently only statically linked programs are supported for optimisation. Diablo was designed to fix many of the problems that were introduced in other link-time binary re-writers such as lack of support for multiple architectures, correctness, safeness, and extensibility. Diablo supports multiple architectures by providing a generic interface with which one can describe a new architecture, many utility functions which are architecture independent exist and help to make the job of a person writing a port a little easier. Diablo is guaranteed to be safe because of the information that is available to it at link-time. Information about relocation s is available and hence we can guarantee that the output binary will be correct. Adding new features to Diablo is also an easy task, generally adding a new feature involves no more that actually writing the feature. There is no need to trawl through the complex internal workings of Diablo. Diablo is conceptually split into two half s, the front end and the back end. The front end is an architecture independent environment in which generic functions call architecture dependant functions within an architecture back end in order to output a final translated binary. Diablo takes as its inputs an original binary, which has been statically linked with the necessary libraries, the libraries (.o and.a object files), and finally a linker map file. The linker map file provides 1 PARIS research group -

14 8 2. BACKGROUND information to Diablo which is necessary to emulate a linked program. The map file contains references to library functions which the original binary calls, with this information Diablo is able to create an emulated statically linked binary. This dissertation describes the implementation of a back end of Diablo for the Alpha architecture using Digital Unix ECoff binaries and the Tru64 UNIX operating system. I also implement some basic optimisations in an attempt to improve program execution behaviour. 2.3 Statically and Dynamically Linked Binaries When a program is linked statically all references within the program are restricted to the final program binary. This means that all a user needs to run the program is the single binary file, the binary contains everything it needs to know about the program. On the other hand, dynamically linked programs contain references to external libraries. These references need to be resolved at run-time on every execution of the program hence performance usually takes a hit when compared to statically linked programs. With statically linked programs we have available to us all the information relating to the program and its libraries hence we have a good opportunity to perform global optimisations. Statically linked binaries however are much larger since we are cramming all of the information from other libraries into one final binary. Although this sounds bad it is cancelled out by the fact that we need need not have a copy of the external libraries to run the program. A program could easily be optimised on a machine where disk space requirements are not an issue, the optimised binary could then be placed on a embedded device with little hard disk space where it would run with no problems. This is all well and good if there are only a few programs which need to reside on the resource lacking device. If the device needs access to a large range of programs then statically linked binaries are not the way to go. In [15] the main problems associated with linking programs dynamically are described and solutions to these problems are posed. 2.4 The Alpha Architecture The Alpha [7] [12] is a true 64-bit super scalar RISC architecture which has been designed with performance as its main focus. There are 64 registers available to use, all of which are 64-bits wide. Registers $0 through $31 are integer registers, registers $32 through $63 are floating point registers. Registers $31 and $63 are

15 2.4. THE ALPHA ARCHITECTURE 9 hard coded to the values 0 and 0.0 respectively. Some of these registers by convention have standard uses: Register $29 ($gp) is used to index the global address table (GAT) which is a table of 64-bit constants. Each function maintains its own GAT, hence re-computation of the $gp (global pointer) register is performed on entry and exit from every function. This property of functions is something that will be looked at in detail later in the dissertation. Registers $0, $32, $33 are generally used to store return values of functions. Registers $16... $21, $48... $53 are used to store function parameters. Functions with more that 6 parameters should use the stack to pass their parameters. This would be useful information when attempting to implement function inlining optimisations. Register $26 ($ra) stores the return address of a function, I.e. this register is saved every time a function is entered so the program can continue execution from the correct location once the function has returned. Register $30 ($sp) is the stack pointer. Register $27 ($pv) is a pointer to the currently executing procedure. Endianess on Alpha systems can either be little or big, however for this project I am assuming that switched endianess has not been enabled so all operations use the little endian byte addressing mode. Enabling support for switched endianess is trivial within Diablo and is a possible candidate for further work. The instruction set is a fixed size IS with all instructions being encoded as 32-bit unsigned integers, instructions are grouped into 5 high level formats: PALcode format. Branch format. Operate format. Floating point operate format. Memory format. Slight variations on these formats exist for instructions such as as unconditional branches and operate instructions with an immediate operand. Instruction format will be mentioned more in the design and implementation of the disassembler and assembler sections of this paper. An instruction format describes which bits of the instruction word are used. For example, all instruction formats use the 6 most significant bits to store the instruction opcode.

16 10 2. BACKGROUND Alpha PALcode Both software developers and operating systems developers require consistent functions across different implementations of the Alpha architecture. However there may be subtle differences in hardware implementation which a developer might like to take advantage of. For example context switches, memory management functions, power management, emulation of instructions with no hardware support etc. PALcode is used to implement such functions using standard ISR instructions which reside in main memory. Other implementations may support such functions in hardware whereas the designers of the Alpha architecture chose to allow modification of such functions via the PALcode interface. PALcode routines can be thought of as software interrupts (SWI) since they require complete control of the machine, interrupts to be disabled. During the port of Diablo we need not worry too much about PALcode libraries, we should just assume that they alter control flow like a regular branch instruction. This is important when it comes to creating the control flow graph.

17 3. Related Work Research in the field of compiler optimisations have been on going for a very long time. Hence there are many related areas of work. So, in this chapter I will attempt to give a quick round up of related in the field and more specifically work related to optimisations and tools which are used post link time. 3.1 Classical Optimisations Classic optimisations are the optimisations which everyone has heard of, they form the basis of today s optimising compilers. It is important to mention them in the context of optimising a program post link time since many of the compile time classical optimisations can be re-applied after linking. The linking process is very beneficial to us, it reveals a lot of interesting information which in turn gives us the opportunity to perform classical optimisations again. Typically a compiler will perform a couple of rounds of classical optimisations, after link time we can do yet another round(s) of classical optimisations to increase program performance Optimisation Techniques and Trade offs In this section we discuss various traditional compile time optimisation along with their link time variants and why these optimisations are viable at or after linking. It is also necessary to introduce these terms in a simple form at this stage in the dissertation since these term will be mentioned heavily throughout the remaining sections Register Allocation Register allocation is an optimisation which attempts to increase the the number of variables which are stored in registers, this is since loading and storing values from and to memory just before they are needed is much more costly than direct register to register operations. Performing register allocation in turn gives us more opportunities to perform other optimisations such as peephole optimisations and strength reduction (the process of replacing costly instructions with cheaper instructions ones). There are two main approaches to register allocation, the older method is based on graph colouring [24]. The newer method is named 11

18 12 3. RELATED WORK linear scan [23]. Graph colouring will produce better register allocation schemes however it takes a much longer time to determine register a scheme. This is not a problem for the case of binaries which we plan to optimise with the port of the optimiser which I have written, however in the case of dynamic compilation strategies such as JIT compilation speed of register allocation is important [27]. Linear scan offers much better speed when producing register allocations however the number of variables which are not stored in registers for longer may suffer. What I have described thus far suffices for register allocation at compile time. It turns out we can perform register allocation again after the linking process! The idea is the same except we can allocate perform register allocation on a global scale. This means liveness information can be propagated between module boundaries, hence a more accurate register allocation scheme can be generated. The term liveness was mentioned briefly but was not explained, I will attempt to explain this idea as it is central to register allocation strategies. A variable or register is said to be live at a specific program execution point if it holds a value which might be used at a later point in its execution path [2]. At its most basic register allocation is the task of assigning live variables into CPU registers where possible. Liveness analysis needs a control flow graph to determine live variables within a program, during my implementation we create a control flow graph with which an improved register allocation scheme could be implemented Peephole Optimisations This class of optimisation is used on small blocks of nearby instructions. In his paper [19] McKeeman first describes peephole optimisations and gives a number of examples. They are so called because examining close by instructions in a block of instructions is said to be like peeping through and hole and peering around. Some examples: X := Y; Z := X + Z Will generate assembly code something similar to: LDA y, (Y) LDA x, (X) ADD Z, x, Z ; Useless, eliminate. Here the 3rd instruction is not required, the add instruction would also be changed to ADD Z, y, Z. Performing these optimisations after linking as well as during

19 3.1. CLASSICAL OPTIMISATIONS 13 compilation should yield performance gains as well as code size reduction. However, removing instructions introduces a problem with regards to jump targets and basic block positioning. These problems can be resolved during the code layout phase of writing the final binary after optimisations have been performed Profile Guided Optimisations Profile guided optimisations (PGO) can be used to tune the performance of a given program by detecting bottlenecks in a programs execution. A compiler which supports profile guided optimisations such as Intel s C/C++ compiler needs to run twice over the source program. The first run of the compiler needs to instrument the program to reveal information such as instruction execution counts or basic block execution counts. The binary must then be run with the instrumentation code added, the information obtained is then fed back into the compiler with the original binary. The compiler can then make more informed decisions about where it optimises the program and which optimisations it applies. Typically programs which contain many branches, for example a checking the return value of a function call against a wide range of values, will perform best with PGO. Programs which are generally fixed in their operation will perform better with PGO when compared to programs which have a much larger possible range of outcomes. Post link time PGO extends on compile time PGO. We can instrument actual library functions and generate profile information using instruction and basic block reference counts etc. Imagine a library function which performs a lot of I/O, this function might open files, create sockets etc. A library function like this might greatly benefit from PGO since usually when a programmer writes an I/O function then check for correct return values, they might even check the value which ERRNO was set to. If the program which calls this function generally does not deviate too much, I.e. the I/O functions which are called usually perform and return the same values then PGO will work excellently in this case Unreachable Code Elimination Unreachable code is code to which there is no path of execution. We will see in later chapters of this dissertation (control flow graph creation) how control transfers are represented in a control flow graph. In the example below, clearly line 8 is unreachable. Constructs like this are detected after the control flow graph has been created, this is one of the easier optimisations to perform and we will find that the optimiser which I am porting already performs this task for us. 1 int func()

20 14 3. RELATED WORK 2 { 3 int x = 10, y = 5, z = 20; 4 5 x *= 2; 6 y = x + z; 7 return (y * y); 8 y *= z; 9 } Line 8 is clearly unreachable, therefore it makes sense to remove this instruction Branch Optimisations Branches are usually the biggest cost inducers in a program since they need to access and set a couple of registers or even memory in some cases. In some cases we can replace branch instructions with non branch instructions such as conditional moves, here we should a see significant performance gain. Branches which need to access memory are called indirect jumps and are the slowest kind of jumps on the Alpha. They need to access the GAT which is a costly operation. It turns out that we can modification instructions of this type if we can figure out the target of the jump. The modification involves changing the instruction to a less costly PC relative jump such as a BR or a BSR. We will see later in the paper that an optimisation like this performs very well. This is due to the fact that these indirect jumps are the instructions which are sometimes used to call a procedure. Since there will be many calls to procedures within a programs lifetime we can greatly reduce execution time by making modifications. 3.2 Existing Tools There has been a lot of work in the field of post link time optimisation in the last decade or so. The topic of link time optimisation was first mentioned in a paper [28] written by Wall at DEC s lab in Palo Alto, California. In his paper Wall describes an intermediate language for their experimental 32-bit machine (Titan) named Mahler. Mahler was designed to be a compromise between traditional assembly code and a compilers intermediate representation, I.e. the Mahler language was used to represent all languages. The Mahler system was comprised of a Mahler assembler which would take a Mahler source file (either hand written or compiler generated) and produce object files containing disassembled Mahler instructions. The assembler in this case unlike modern day assemblers generated heavily annotated code. The assembler would record a list of variables defined

21 3.2. EXISTING TOOLS 15 and procedures called within a module, relocation information was also stored. Storing relocation information at this stage meant they were able to employ a smart register allocation scheme at link time. This scheme was used to determine if certain load and store instructions could be promoted to registers rather than stored in memory, hence giving a certain performance boost. The most interesting portion of the Titan/Mahler system was the optimising linker. The linker was able to perform interesting low level and high level optimisations such as register allocation and instruction scheduling. The information retained by the Mahler assembler was very useful, source level constructs were annotated and therefore the linker could perform surprisingly sophisticated optimisation at which is a very late stage the process of generating a final binary. Most subsequent work in the area of link time and post link time optimisation was based on the ideas presented in this paper OM Tools such as OM [25]which is a another link time optimising system for DEC s Alpha architecture sprung up after Wall s paper. OM was developed by the same researchers at DEC s lab in Palo Alto, and is much more practical than the mostly theoretical approach using the Mahler system which was described in [28]. OM s approach is nearing the same approach which a modern day link time optimiser such as Diablo takes. OM reads all the libraries and binaries which constitutes a program, the code within the program is converted into a Register Transfer Language (RTL). The RTL is used to represent the code that must be generated, it is essentially a bridge between a high level language and assembly language. RTL s are usually specified in LISP style syntax and are usually implementation independent, that is one could look at an RTL description and not know the target architecture instruction set. The RTL is then converted into a control flow graph, nodes in the graph are basic blocks and edges in the graph are possible paths of execution. Once a control flow graph has been created OM performs global code optimisations. It can perform these global optimisations since the entire program is represented by the control flow graph. In [21] they describe solutions to interprocedural liveness analysis by using an interprocedural control flow graph. OM used this technique along with several others to perform liveness analysis of registers during the linking phase of building a program. The developers of OM wanted a light weight optimiser which performed post link time optimisations, since this was a goal they have sacrificed the number of optimisations which are performed on a program. Many other optimisations which could be performed after linking were skipped, OM does however implement the following optimisations [20]: Register allocation, instruction scheduling, dead code elimination, peephole optimisations, OM also supports profile guided optimisations after linking. Overall OM on average saw around 5% gains in program

22 16 3. RELATED WORK effeciency, sometimes seeing gains as good as 12% in some cases [25] ALTO After the eventual stagnation of the OM tool yet another tool named Alto [20] appeared as result of Robert Muths PhD dissertation from the university of Arizona. Alto strived to be a much more complete link time optimisation tool and succeeded in bettering OM in terms of results quite considerably. Alto included the following optimisations within its framework: Code inlining, instruction scheduling, interprocedural liveness analysis, peephole optimisations, instruction and block elimination, instruction normalising, constant propagation, code compression. With this wide range of optimisations Alto saw excellent results in terms of program execution time and code size. On average Alto sees gains of up to 38% in terms of code size and gains of up to 6% on average in terms of execution time [20]. However Alto is not very portable, it only works on statically linked Tru64 UNIX ECoff binaries. As far as I am aware a port to GNU/Linux and ELF binaries was planned but was never released to the public. Alto also suffered from incurring a large memory overhead when it ran, this is since all data in Alto was allocated statically on the heap. There is no dynamic memory allocation using standard C functions such as malloc(3) and free(3) Squeeze, Squeeze++ & Diablo Squeeze and Squeeze++ [11] [9] followed from Alto and focused mainly on code compression issues which were becoming more and more important at this time due to the proliferation of embedded devices on the market. Finally Diablo seeks to take the best parts of all the tools already mentioned whilst incorporating new ideas. Many of the concepts in the other tools are very outdated, Diablo seeks to bring new ideas together as well as reusing the techniques which were successful in the older link time optimisation tools. Diablo shows code size reduction of approximately 15% to 5% on the ARM platform. The benchmarks which were used to produce these results were compiled with an incredibly aggressive industry grade optimising developer suite release by ARM (ARM developer suite), this needs to be taken into consideration before thinking in detail about the results Diablo has shown on the ARM platform.

23 3.3. LINK TIME OPTIMISATION TECHNIQUES Diota Diota [18] (Dynamic Instrumentation, Optimisation and Transformations of Applications) is a tool which works with dynamically linked programs which is in contrast to all of the previous tools which work only with statically linked programs. Instrumentation has been briefly covered in the PGO section of this paper, therefore I will only briefly describe the instrumentation which Diota performs. Diota can insert instructions at specific places in a program in order to reveal information about the program which is otherwise unavailable until the program is actually executed. Using this technique hot code can be discovered and hence optimised more aggressively or more specifically with regards to the codes context. Instrumentation in Diota however can be used for more than just this. For example temporarily fixing errors without recompilation, or adding support for newer instructions offered by the target native instruction set. Gathering information about a programs execution is referred to as passive instrumentation whereas actually modifying the final binary program is known as active instrumentation. Diota is able to monitor all memory access, hence is able to detect race conditions. Race conditions in programs such as Mozilla and Konqueror have been detected whilst under the control of Diota. 3.3 Link Time Optimisation Techniques The tools mentioned in the previous section show good results for code compaction and similarly code results for execution time. How exactly does an optimiser reduce the size of a binary by in some cases close to 50%? Execution time is also reduced. In this section of the dissertation I introduce some the related work in the field of post link optimisation which has led to such impressive results Specific Code Optimisation Work Most C/C++ development tool chains only perform optimisations at compile time. And as we have previously mentioned the compiler is limited to only source code modules when performing optimisations. Modules can be seen as boundaries between different parts of a program and in turn can also be seen as optimisation boundaries[9]. Typically linkers perform little or no optimisations, the Intel C/C++ tool chain is however an exception to this rule. Usually a linker will only resolve dependencies in programs by searching libraries for external functions ignoring any opportunities for performing global optimisation. Another less conspicuous problem is the advocacy of code reuse in the software engineering community. Students are also told to write code to handle every

24 18 3. RELATED WORK situation, programs also may have functions which were implemented for use in case a new situation arose. Often these cases and functions are never used hence binaries become bloated with references to useless objects [9]. The compiler cannot remove these cases because it does not know in which context the program is called, calling contexts are visible to us at after linking since we have a global view of the program. We know which objects are useless if we perform control flow analysis, hence we can eliminate them. The other ports of Diablo currently employ most modern code size reduction methods: Post link time value analysis [9] statically determines whether a registers value, independent of the programs input, is either constant or can take a value from some restricted set. Value analysis is used to remove portions of code which we can determine to never been executed, or to remove portions of code who s result is already known. A good example of value analysis is constant propagation [4]. Compilers are able to perform only limited constant propagation since they do not know anything about the data and code addresses which are determined at link time. Constant propagation algorithms work in the same way at compile time and post link time, at link time we just have a much wider view of the program and hence much more access to information regarding constants. We can also perform post link time register re-allocation schemes. De Sutter et al [10] make improvements to backwards liveness analysis and introduce forward liveness analysis. Combined forward and backwards liveness has resulted in excellent results on the Alpha architecture; the analysis results in finding 62% more than the previous attempts at finding live registers. This translates to an amazing conclusion that on average only half of the Alpha registers are used. Since compilers compile source modules separately they must adhere to calling conventions. Calling conventions are defined in an application binary interface (ABI) and define conventions such as which registers must be saved and when, which registers are used to store parameters which are passed to functions, and which registers are used for storing return values etc. The method described in [10] works by bending these calling conventions. This is safe to do since at link time we can see the whole program as one, code does not jump around from module to module. Therefore it is safe to abuse calling conventions as long as it does not sacrifice correctness. Now, given calling convention information the advanced interprocedural liveness analysis can be performed. In [10] they improve on state of the art liveness analysis described by Robert Muth in [20], this technique was used in Alto, Squeeze and Squeeze++ [11] [9] with good results.

25 3.3. LINK TIME OPTIMISATION TECHNIQUES Smart Code Layout Code layout refers to the process of converting the nodes (basic blocks) within a control flow graph into a straight line program which can be executed on the target machine [2]. The general strategy is to place blocks in which execution passes straight through, that is blocks which do not end in jumps. Jumps are costly instructions which degrade I-Cache (Instruction Cache) performance [12]. I-Cache performance will be greatest when instructions which are close together are also layed out in memory close together. This is the principle of cache locality, and holds true since instructions which are located close together are likely to refer to the same registers and memory locations. Therefore we can increase a programs performance in terms of execution time by laying code out in a way which is complementary to cache locality. In [22] they describe an algorithm which used profiling data to help layout basic blocks in an efficient manner. Within most programs there will be paths of execution which are frequently taken, and paths of executions which are rarely taken. Traditional compilers will often place these blocks together and in doing so minimise the effect of a caching system. What Pettis et al describe is a method of eliminating these cases by moving the infrequently executed paths out of the way of the frequently executed paths, I.e. increasing cache performance. Their method also expects longer period of time before a branch is taken which again will improve I-Cache hit rates. Their method uses a weighted di-graph, each node in the graph is a basic block and edges represent a control flow path between two basic blocks, I.e. a jump from one to another or a fall through (straight line) edge. Each edge is weighted using the data gained during the profiling run, a weight between two nodes represents the amount of times the choice was taken. Their algorithm starts by finding the edge of largest weight in the graph, the nodes which are connected by this edge are chained together. In the graph the nodes which were chosen for chaining are merged together and the edges which led from the nodes are coalesced. This process is continued until the graph contains one or nodes with no edges. This method results in excellent I-Cache usage improvement, benchmarks which were tested on the Hewlett & Packard PA-RISC architecture and showed a 98.2% improvement in I-Cache performance. The method described above and in [22] performs very well however the algorithm is rather old and there have been further improvement in the area of code layout. One such new technique which is forming a new area of search in compiler optimisation is the use of artificial intelligence. In [16] a novel approach to finding optimal code layouts using neural networks has been developed. The approach in the paper is an architecture dependant approach where, profiling information is also required for proper implementation of the algorithm. The profiling information is used to determine features of the neural network, information such as instruction type, the direction of the branch (either forward: positive displace-

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

Heap Management. Heap Allocation

Heap Management. Heap Allocation Heap Management Heap Allocation A very flexible storage allocation mechanism is heap allocation. Any number of data objects can be allocated and freed in a memory pool, called a heap. Heap allocation is

More information

Instruction Set Principles and Examples. Appendix B

Instruction Set Principles and Examples. Appendix B Instruction Set Principles and Examples Appendix B Outline What is Instruction Set Architecture? Classifying ISA Elements of ISA Programming Registers Type and Size of Operands Addressing Modes Types of

More information

Virtual Machines and Dynamic Translation: Implementing ISAs in Software

Virtual Machines and Dynamic Translation: Implementing ISAs in Software Virtual Machines and Dynamic Translation: Implementing ISAs in Software Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Software Applications How is a software application

More information

Contents. Slide Set 1. About these slides. Outline of Slide Set 1. Typographical conventions: Italics. Typographical conventions. About these slides

Contents. Slide Set 1. About these slides. Outline of Slide Set 1. Typographical conventions: Italics. Typographical conventions. About these slides Slide Set 1 for ENCM 369 Winter 2014 Lecture Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary Winter Term, 2014 ENCM 369 W14 Section

More information

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano Building a Runnable Program and Code Improvement Dario Marasco, Greg Klepic, Tess DiStefano Building a Runnable Program Review Front end code Source code analysis Syntax tree Back end code Target code

More information

(Refer Slide Time: 1:40)

(Refer Slide Time: 1:40) Computer Architecture Prof. Anshul Kumar Department of Computer Science and Engineering, Indian Institute of Technology, Delhi Lecture - 3 Instruction Set Architecture - 1 Today I will start discussion

More information

CHAPTER 5 A Closer Look at Instruction Set Architectures

CHAPTER 5 A Closer Look at Instruction Set Architectures CHAPTER 5 A Closer Look at Instruction Set Architectures 5.1 Introduction 199 5.2 Instruction Formats 199 5.2.1 Design Decisions for Instruction Sets 200 5.2.2 Little versus Big Endian 201 5.2.3 Internal

More information

DIABLO: a reliable, retargetable and extensible link-time rewriting framework

DIABLO: a reliable, retargetable and extensible link-time rewriting framework DIABLO: a reliable, retargetable and extensible link-time rewriting framework (Invited Paper) Ludo Van Put, Dominique Chanet, Bruno De Bus, Bjorn De Sutter and Koen De Bosschere Ghent University Sint-Pietersnieuwstraat

More information

Deallocation Mechanisms. User-controlled Deallocation. Automatic Garbage Collection

Deallocation Mechanisms. User-controlled Deallocation. Automatic Garbage Collection Deallocation Mechanisms User-controlled Deallocation Allocating heap space is fairly easy. But how do we deallocate heap memory no longer in use? Sometimes we may never need to deallocate! If heaps objects

More information

Link-Time Compaction and Optimization of ARM Executables

Link-Time Compaction and Optimization of ARM Executables Link-Time Compaction and Optimization of ARM Executables BJORN DE SUTTER, LUDO VAN PUT, DOMINIQUE CHANET, BRUNO DE BUS, and KOEN DE BOSSCHERE Ghent University The overhead in terms of code size, power

More information

Arm Assembly Language programming. 2. Inside the ARM

Arm Assembly Language programming. 2. Inside the ARM 2. Inside the ARM In the previous chapter, we started by considering instructions executed by a mythical processor with mnemonics like ON and OFF. Then we went on to describe some of the features of an

More information

ECE 486/586. Computer Architecture. Lecture # 7

ECE 486/586. Computer Architecture. Lecture # 7 ECE 486/586 Computer Architecture Lecture # 7 Spring 2015 Portland State University Lecture Topics Instruction Set Principles Instruction Encoding Role of Compilers The MIPS Architecture Reference: Appendix

More information

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture Computer Science 324 Computer Architecture Mount Holyoke College Fall 2007 Topic Notes: MIPS Instruction Set Architecture vonneumann Architecture Modern computers use the vonneumann architecture. Idea:

More information

WACC Report. Zeshan Amjad, Rohan Padmanabhan, Rohan Pritchard, & Edward Stow

WACC Report. Zeshan Amjad, Rohan Padmanabhan, Rohan Pritchard, & Edward Stow WACC Report Zeshan Amjad, Rohan Padmanabhan, Rohan Pritchard, & Edward Stow 1 The Product Our compiler passes all of the supplied test cases, and over 60 additional test cases we wrote to cover areas (mostly

More information

Group B Assignment 8. Title of Assignment: Problem Definition: Code optimization using DAG Perquisite: Lex, Yacc, Compiler Construction

Group B Assignment 8. Title of Assignment: Problem Definition: Code optimization using DAG Perquisite: Lex, Yacc, Compiler Construction Group B Assignment 8 Att (2) Perm(3) Oral(5) Total(10) Sign Title of Assignment: Code optimization using DAG. 8.1.1 Problem Definition: Code optimization using DAG. 8.1.2 Perquisite: Lex, Yacc, Compiler

More information

Language Translation. Compilation vs. interpretation. Compilation diagram. Step 1: compile. Step 2: run. compiler. Compiled program. program.

Language Translation. Compilation vs. interpretation. Compilation diagram. Step 1: compile. Step 2: run. compiler. Compiled program. program. Language Translation Compilation vs. interpretation Compilation diagram Step 1: compile program compiler Compiled program Step 2: run input Compiled program output Language Translation compilation is translation

More information

Advanced Computer Architecture

Advanced Computer Architecture ECE 563 Advanced Computer Architecture Fall 2007 Lecture 14: Virtual Machines 563 L14.1 Fall 2009 Outline Types of Virtual Machine User-level (or Process VMs) System-level Techniques for implementing all

More information

16 Sharing Main Memory Segmentation and Paging

16 Sharing Main Memory Segmentation and Paging Operating Systems 64 16 Sharing Main Memory Segmentation and Paging Readings for this topic: Anderson/Dahlin Chapter 8 9; Siberschatz/Galvin Chapter 8 9 Simple uniprogramming with a single segment per

More information

Overview. EE 4504 Computer Organization. Much of the computer s architecture / organization is hidden from a HLL programmer

Overview. EE 4504 Computer Organization. Much of the computer s architecture / organization is hidden from a HLL programmer Overview EE 4504 Computer Organization Section 7 The Instruction Set Much of the computer s architecture / organization is hidden from a HLL programmer In the abstract sense, the programmer should not

More information

15 Sharing Main Memory Segmentation and Paging

15 Sharing Main Memory Segmentation and Paging Operating Systems 58 15 Sharing Main Memory Segmentation and Paging Readings for this topic: Anderson/Dahlin Chapter 8 9; Siberschatz/Galvin Chapter 8 9 Simple uniprogramming with a single segment per

More information

CHAPTER 5 A Closer Look at Instruction Set Architectures

CHAPTER 5 A Closer Look at Instruction Set Architectures CHAPTER 5 A Closer Look at Instruction Set Architectures 5.1 Introduction 5.2 Instruction Formats 5.2.1 Design Decisions for Instruction Sets 5.2.2 Little versus Big Endian 5.2.3 Internal Storage in the

More information

Computer Architecture Prof. Mainak Chaudhuri Department of Computer Science & Engineering Indian Institute of Technology, Kanpur

Computer Architecture Prof. Mainak Chaudhuri Department of Computer Science & Engineering Indian Institute of Technology, Kanpur Computer Architecture Prof. Mainak Chaudhuri Department of Computer Science & Engineering Indian Institute of Technology, Kanpur Lecture - 7 Case study with MIPS-I So, we were discussing (Refer Time: 00:20),

More information

Lecture 4: Instruction Set Architecture

Lecture 4: Instruction Set Architecture Lecture 4: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation Reading: Textbook (5 th edition) Appendix A Appendix B (4 th edition)

More information

Using Cache Line Coloring to Perform Aggressive Procedure Inlining

Using Cache Line Coloring to Perform Aggressive Procedure Inlining Using Cache Line Coloring to Perform Aggressive Procedure Inlining Hakan Aydın David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA, 02115 {haydin,kaeli}@ece.neu.edu

More information

COPYRIGHTED MATERIAL. What Is Assembly Language? Processor Instructions

COPYRIGHTED MATERIAL. What Is Assembly Language? Processor Instructions What Is Assembly Language? One of the first hurdles to learning assembly language programming is understanding just what assembly language is. Unlike other programming languages, there is no one standard

More information

printf( Please enter another number: ); scanf( %d, &num2);

printf( Please enter another number: ); scanf( %d, &num2); CIT 593 Intro to Computer Systems Lecture #13 (11/1/12) Now that we've looked at how an assembly language program runs on a computer, we're ready to move up a level and start working with more powerful

More information

PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS

PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS Objective PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS Explain what is meant by compiler. Explain how the compiler works. Describe various analysis of the source program. Describe the

More information

CHAPTER 5 A Closer Look at Instruction Set Architectures

CHAPTER 5 A Closer Look at Instruction Set Architectures CHAPTER 5 A Closer Look at Instruction Set Architectures 5.1 Introduction 293 5.2 Instruction Formats 293 5.2.1 Design Decisions for Instruction Sets 294 5.2.2 Little versus Big Endian 295 5.2.3 Internal

More information

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 11

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 11 CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 11 CS 536 Spring 2015 1 Handling Overloaded Declarations Two approaches are popular: 1. Create a single symbol table

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Run-Time Environments/Garbage Collection

Run-Time Environments/Garbage Collection Run-Time Environments/Garbage Collection Department of Computer Science, Faculty of ICT January 5, 2014 Introduction Compilers need to be aware of the run-time environment in which their compiled programs

More information

Intermediate Code & Local Optimizations

Intermediate Code & Local Optimizations Lecture Outline Intermediate Code & Local Optimizations Intermediate code Local optimizations Compiler Design I (2011) 2 Code Generation Summary We have so far discussed Runtime organization Simple stack

More information

For our next chapter, we will discuss the emulation process which is an integral part of virtual machines.

For our next chapter, we will discuss the emulation process which is an integral part of virtual machines. For our next chapter, we will discuss the emulation process which is an integral part of virtual machines. 1 2 For today s lecture, we ll start by defining what we mean by emulation. Specifically, in this

More information

Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No. # 10 Lecture No. # 16 Machine-Independent Optimizations Welcome to the

More information

Chapter 7 The Potential of Special-Purpose Hardware

Chapter 7 The Potential of Special-Purpose Hardware Chapter 7 The Potential of Special-Purpose Hardware The preceding chapters have described various implementation methods and performance data for TIGRE. This chapter uses those data points to propose architecture

More information

Topic Notes: MIPS Instruction Set Architecture

Topic Notes: MIPS Instruction Set Architecture Computer Science 220 Assembly Language & Comp. Architecture Siena College Fall 2011 Topic Notes: MIPS Instruction Set Architecture vonneumann Architecture Modern computers use the vonneumann architecture.

More information

Practical Malware Analysis

Practical Malware Analysis Practical Malware Analysis Ch 4: A Crash Course in x86 Disassembly Revised 1-16-7 Basic Techniques Basic static analysis Looks at malware from the outside Basic dynamic analysis Only shows you how the

More information

This section covers the MIPS instruction set.

This section covers the MIPS instruction set. This section covers the MIPS instruction set. 1 + I am going to break down the instructions into two types. + a machine instruction which is directly defined in the MIPS architecture and has a one to one

More information

Computer Systems A Programmer s Perspective 1 (Beta Draft)

Computer Systems A Programmer s Perspective 1 (Beta Draft) Computer Systems A Programmer s Perspective 1 (Beta Draft) Randal E. Bryant David R. O Hallaron August 1, 2001 1 Copyright c 2001, R. E. Bryant, D. R. O Hallaron. All rights reserved. 2 Contents Preface

More information

EC 413 Computer Organization

EC 413 Computer Organization EC 413 Computer Organization Review I Prof. Michel A. Kinsy Computing: The Art of Abstraction Application Algorithm Programming Language Operating System/Virtual Machine Instruction Set Architecture (ISA)

More information

Memory: Overview. CS439: Principles of Computer Systems February 26, 2018

Memory: Overview. CS439: Principles of Computer Systems February 26, 2018 Memory: Overview CS439: Principles of Computer Systems February 26, 2018 Where We Are In the Course Just finished: Processes & Threads CPU Scheduling Synchronization Next: Memory Management Virtual Memory

More information

Chapter 2A Instructions: Language of the Computer

Chapter 2A Instructions: Language of the Computer Chapter 2A Instructions: Language of the Computer Copyright 2009 Elsevier, Inc. All rights reserved. Instruction Set The repertoire of instructions of a computer Different computers have different instruction

More information

Symbol Tables Symbol Table: In computer science, a symbol table is a data structure used by a language translator such as a compiler or interpreter, where each identifier in a program's source code is

More information

Chapter 2. Instruction Set Principles and Examples. In-Cheol Park Dept. of EE, KAIST

Chapter 2. Instruction Set Principles and Examples. In-Cheol Park Dept. of EE, KAIST Chapter 2. Instruction Set Principles and Examples In-Cheol Park Dept. of EE, KAIST Stack architecture( 0-address ) operands are on the top of the stack Accumulator architecture( 1-address ) one operand

More information

Intermediate Code & Local Optimizations. Lecture 20

Intermediate Code & Local Optimizations. Lecture 20 Intermediate Code & Local Optimizations Lecture 20 Lecture Outline Intermediate code Local optimizations Next time: global optimizations 2 Code Generation Summary We have discussed Runtime organization

More information

RISC Principles. Introduction

RISC Principles. Introduction 3 RISC Principles In the last chapter, we presented many details on the processor design space as well as the CISC and RISC architectures. It is time we consolidated our discussion to give details of RISC

More information

Chapter 5. A Closer Look at Instruction Set Architectures. Chapter 5 Objectives. 5.1 Introduction. 5.2 Instruction Formats

Chapter 5. A Closer Look at Instruction Set Architectures. Chapter 5 Objectives. 5.1 Introduction. 5.2 Instruction Formats Chapter 5 Objectives Understand the factors involved in instruction set architecture design. Chapter 5 A Closer Look at Instruction Set Architectures Gain familiarity with memory addressing modes. Understand

More information

Slide Set 1 (corrected)

Slide Set 1 (corrected) Slide Set 1 (corrected) for ENCM 369 Winter 2018 Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary January 2018 ENCM 369 Winter 2018

More information

Page 1. Structure of von Nuemann machine. Instruction Set - the type of Instructions

Page 1. Structure of von Nuemann machine. Instruction Set - the type of Instructions Structure of von Nuemann machine Arithmetic and Logic Unit Input Output Equipment Main Memory Program Control Unit 1 1 Instruction Set - the type of Instructions Arithmetic + Logical (ADD, SUB, MULT, DIV,

More information

Chapter 5. A Closer Look at Instruction Set Architectures

Chapter 5. A Closer Look at Instruction Set Architectures Chapter 5 A Closer Look at Instruction Set Architectures Chapter 5 Objectives Understand the factors involved in instruction set architecture design. Gain familiarity with memory addressing modes. Understand

More information

Lecture 4: MIPS Instruction Set

Lecture 4: MIPS Instruction Set Lecture 4: MIPS Instruction Set No class on Tuesday Today s topic: MIPS instructions Code examples 1 Instruction Set Understanding the language of the hardware is key to understanding the hardware/software

More information

Program Optimization

Program Optimization Program Optimization Professor Jennifer Rexford http://www.cs.princeton.edu/~jrex 1 Goals of Today s Class Improving program performance o When and what to optimize o Better algorithms & data structures

More information

Chapter 3:: Names, Scopes, and Bindings (cont.)

Chapter 3:: Names, Scopes, and Bindings (cont.) Chapter 3:: Names, Scopes, and Bindings (cont.) Programming Language Pragmatics Michael L. Scott Review What is a regular expression? What is a context-free grammar? What is BNF? What is a derivation?

More information

COS 140: Foundations of Computer Science

COS 140: Foundations of Computer Science COS 140: Foundations of Computer Science CPU Organization and Assembly Language Fall 2018 CPU 3 Components of the CPU..................................................... 4 Registers................................................................

More information

Compiler Passes. Optimization. The Role of the Optimizer. Optimizations. The Optimizer (or Middle End) Traditional Three-pass Compiler

Compiler Passes. Optimization. The Role of the Optimizer. Optimizations. The Optimizer (or Middle End) Traditional Three-pass Compiler Compiler Passes Analysis of input program (front-end) character stream Lexical Analysis Synthesis of output program (back-end) Intermediate Code Generation Optimization Before and after generating machine

More information

Reversing. Time to get with the program

Reversing. Time to get with the program Reversing Time to get with the program This guide is a brief introduction to C, Assembly Language, and Python that will be helpful for solving Reversing challenges. Writing a C Program C is one of the

More information

Lecture Outline. Intermediate code Intermediate Code & Local Optimizations. Local optimizations. Lecture 14. Next time: global optimizations

Lecture Outline. Intermediate code Intermediate Code & Local Optimizations. Local optimizations. Lecture 14. Next time: global optimizations Lecture Outline Intermediate code Intermediate Code & Local Optimizations Lecture 14 Local optimizations Next time: global optimizations Prof. Aiken CS 143 Lecture 14 1 Prof. Aiken CS 143 Lecture 14 2

More information

Instruction Set Architecture (ISA)

Instruction Set Architecture (ISA) Instruction Set Architecture (ISA)... the attributes of a [computing] system as seen by the programmer, i.e. the conceptual structure and functional behavior, as distinct from the organization of the data

More information

Basic Memory Management. Basic Memory Management. Address Binding. Running a user program. Operating Systems 10/14/2018 CSC 256/456 1

Basic Memory Management. Basic Memory Management. Address Binding. Running a user program. Operating Systems 10/14/2018 CSC 256/456 1 Basic Memory Management Program must be brought into memory and placed within a process for it to be run Basic Memory Management CS 256/456 Dept. of Computer Science, University of Rochester Mono-programming

More information

(Refer Slide Time: 01:25)

(Refer Slide Time: 01:25) Computer Architecture Prof. Anshul Kumar Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture - 32 Memory Hierarchy: Virtual Memory (contd.) We have discussed virtual

More information

Overview. Introduction to the MIPS ISA. MIPS ISA Overview. Overview (2)

Overview. Introduction to the MIPS ISA. MIPS ISA Overview. Overview (2) Introduction to the MIPS ISA Overview Remember that the machine only understands very basic instructions (machine instructions) It is the compiler s job to translate your high-level (e.g. C program) into

More information

Comp 11 Lectures. Mike Shah. June 26, Tufts University. Mike Shah (Tufts University) Comp 11 Lectures June 26, / 57

Comp 11 Lectures. Mike Shah. June 26, Tufts University. Mike Shah (Tufts University) Comp 11 Lectures June 26, / 57 Comp 11 Lectures Mike Shah Tufts University June 26, 2017 Mike Shah (Tufts University) Comp 11 Lectures June 26, 2017 1 / 57 Please do not distribute or host these slides without prior permission. Mike

More information

T Jarkko Turkulainen, F-Secure Corporation

T Jarkko Turkulainen, F-Secure Corporation T-110.6220 2010 Emulators and disassemblers Jarkko Turkulainen, F-Secure Corporation Agenda Disassemblers What is disassembly? What makes up an instruction? How disassemblers work Use of disassembly In

More information

Personalised Learning Checklist ( ) SOUND

Personalised Learning Checklist ( ) SOUND Personalised Learning Checklist (2015-2016) Subject: Computing Level: A2 Name: Outlined below are the topics you have studied for this course. Inside each topic area you will find a breakdown of the topic

More information

Programmer Directed GC for C++ Michael Spertus N2286= April 16, 2007

Programmer Directed GC for C++ Michael Spertus N2286= April 16, 2007 Programmer Directed GC for C++ Michael Spertus N2286=07-0146 April 16, 2007 Garbage Collection Automatically deallocates memory of objects that are no longer in use. For many popular languages, garbage

More information

Project. there are a couple of 3 person teams. a new drop with new type checking is coming. regroup or see me or forever hold your peace

Project. there are a couple of 3 person teams. a new drop with new type checking is coming. regroup or see me or forever hold your peace Project there are a couple of 3 person teams regroup or see me or forever hold your peace a new drop with new type checking is coming using it is optional 1 Compiler Architecture source code Now we jump

More information

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi Lecture - 34 Compilers for Embedded Systems Today, we shall look at the compilers, which

More information

Summary: Open Questions:

Summary: Open Questions: Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization

More information

Why Study Assembly Language?

Why Study Assembly Language? Why Study Assembly Language? This depends on the decade in which you studied assembly language. 1940 s You cannot study assembly language. It does not exist yet. 1950 s You study assembly language because,

More information

Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore (Refer Slide Time: 00:27) Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore Lecture - 1 An Overview of a Compiler Welcome

More information

Programming at different levels

Programming at different levels CS2214 COMPUTER ARCHITECTURE & ORGANIZATION SPRING 2014 EMY MNEMONIC MACHINE LANGUAGE PROGRAMMING EXAMPLES Programming at different levels CS1114 Mathematical Problem : a = b + c CS2214 CS2214 The C-like

More information

Design of CPU Simulation Software for ARMv7 Instruction Set Architecture

Design of CPU Simulation Software for ARMv7 Instruction Set Architecture Design of CPU Simulation Software for ARMv7 Instruction Set Architecture Author: Dillon Tellier Advisor: Dr. Christopher Lupo Date: June 2014 1 INTRODUCTION Simulations have long been a part of the engineering

More information

CE221 Programming in C++ Part 1 Introduction

CE221 Programming in C++ Part 1 Introduction CE221 Programming in C++ Part 1 Introduction 06/10/2017 CE221 Part 1 1 Module Schedule There are two lectures (Monday 13.00-13.50 and Tuesday 11.00-11.50) each week in the autumn term, and a 2-hour lab

More information

Chapter 3:: Names, Scopes, and Bindings (cont.)

Chapter 3:: Names, Scopes, and Bindings (cont.) Chapter 3:: Names, Scopes, and Bindings (cont.) Programming Language Pragmatics Michael L. Scott Review What is a regular expression? What is a context-free grammar? What is BNF? What is a derivation?

More information

CSCI 402: Computer Architectures. Instructions: Language of the Computer (4) Fengguang Song Department of Computer & Information Science IUPUI

CSCI 402: Computer Architectures. Instructions: Language of the Computer (4) Fengguang Song Department of Computer & Information Science IUPUI CSCI 402: Computer Architectures Instructions: Language of the Computer (4) Fengguang Song Department of Computer & Information Science IUPUI op Instruction address 6 bits 26 bits Jump Addressing J-type

More information

Memory Management. Reading: Silberschatz chapter 9 Reading: Stallings. chapter 7 EEL 358

Memory Management. Reading: Silberschatz chapter 9 Reading: Stallings. chapter 7 EEL 358 Memory Management Reading: Silberschatz chapter 9 Reading: Stallings chapter 7 1 Outline Background Issues in Memory Management Logical Vs Physical address, MMU Dynamic Loading Memory Partitioning Placement

More information

A Feasibility Study for Methods of Effective Memoization Optimization

A Feasibility Study for Methods of Effective Memoization Optimization A Feasibility Study for Methods of Effective Memoization Optimization Daniel Mock October 2018 Abstract Traditionally, memoization is a compiler optimization that is applied to regions of code with few

More information

Lecture Notes on Liveness Analysis

Lecture Notes on Liveness Analysis Lecture Notes on Liveness Analysis 15-411: Compiler Design Frank Pfenning André Platzer Lecture 4 1 Introduction We will see different kinds of program analyses in the course, most of them for the purpose

More information

IRIX is moving in the n32 direction, and n32 is now the default, but the toolchain still supports o32. When we started supporting native mode o32 was

IRIX is moving in the n32 direction, and n32 is now the default, but the toolchain still supports o32. When we started supporting native mode o32 was Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.035, Fall 2002 Handout 23 Running Under IRIX Thursday, October 3 IRIX sucks. This handout describes what

More information

Lecture 4: Instruction Set Design/Pipelining

Lecture 4: Instruction Set Design/Pipelining Lecture 4: Instruction Set Design/Pipelining Instruction set design (Sections 2.9-2.12) control instructions instruction encoding Basic pipelining implementation (Section A.1) 1 Control Transfer Instructions

More information

3. Simple Types, Variables, and Constants

3. Simple Types, Variables, and Constants 3. Simple Types, Variables, and Constants This section of the lectures will look at simple containers in which you can storing single values in the programming language C++. You might find it interesting

More information

Other Forms of Intermediate Code. Local Optimizations. Lecture 34

Other Forms of Intermediate Code. Local Optimizations. Lecture 34 Other Forms of Intermediate Code. Local Optimizations Lecture 34 (Adapted from notes by R. Bodik and G. Necula) 4/18/08 Prof. Hilfinger CS 164 Lecture 34 1 Administrative HW #5 is now on-line. Due next

More information

Administrative. Other Forms of Intermediate Code. Local Optimizations. Lecture 34. Code Generation Summary. Why Intermediate Languages?

Administrative. Other Forms of Intermediate Code. Local Optimizations. Lecture 34. Code Generation Summary. Why Intermediate Languages? Administrative Other Forms of Intermediate Code. Local Optimizations HW #5 is now on-line. Due next Friday. If your test grade is not glookupable, please tell us. Please submit test regrading pleas to

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Understand the factors involved in instruction set

Understand the factors involved in instruction set A Closer Look at Instruction Set Architectures Objectives Understand the factors involved in instruction set architecture design. Look at different instruction formats, operand types, and memory access

More information

Chapter 13 Reduced Instruction Set Computers

Chapter 13 Reduced Instruction Set Computers Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining

More information

Chapter 5. A Closer Look at Instruction Set Architectures

Chapter 5. A Closer Look at Instruction Set Architectures Chapter 5 A Closer Look at Instruction Set Architectures Chapter 5 Objectives Understand the factors involved in instruction set architecture design. Gain familiarity with memory addressing modes. Understand

More information

The X86 Assembly Language Instruction Nop Means

The X86 Assembly Language Instruction Nop Means The X86 Assembly Language Instruction Nop Means As little as 1 CPU cycle is "wasted" to execute a NOP instruction (the exact and other "assembly tricks", as explained also in this thread on Programmers.

More information

Interprocedural Variable Liveness Analysis for Function Signature Recovery

Interprocedural Variable Liveness Analysis for Function Signature Recovery Interprocedural Variable Liveness Analysis for Function Signature Recovery MIGUEL ARAUJO AND AHMED BOUGACHA {maraujo@cs, ahmed.bougacha@sv}.cmu.edu Carnegie Mellon University April 30, 2014 Final Project

More information

Basic Memory Management

Basic Memory Management Basic Memory Management CS 256/456 Dept. of Computer Science, University of Rochester 10/15/14 CSC 2/456 1 Basic Memory Management Program must be brought into memory and placed within a process for it

More information

Stored Program Concept. Instructions: Characteristics of Instruction Set. Architecture Specification. Example of multiple operands

Stored Program Concept. Instructions: Characteristics of Instruction Set. Architecture Specification. Example of multiple operands Stored Program Concept Instructions: Instructions are bits Programs are stored in memory to be read or written just like data Processor Memory memory for data, programs, compilers, editors, etc. Fetch

More information

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Instruction Set Principles The Role of Compilers MIPS 2 Main Content Computer

More information

Practice Problems (Con t) The ALU performs operation x and puts the result in the RR The ALU operand Register B is loaded with the contents of Rx

Practice Problems (Con t) The ALU performs operation x and puts the result in the RR The ALU operand Register B is loaded with the contents of Rx Microprogram Control Practice Problems (Con t) The following microinstructions are supported by each CW in the CS: RR ALU opx RA Rx RB Rx RB IR(adr) Rx RR Rx MDR MDR RR MDR Rx MAR IR(adr) MAR Rx PC IR(adr)

More information

Lecture #2 January 30, 2004 The 6502 Architecture

Lecture #2 January 30, 2004 The 6502 Architecture Lecture #2 January 30, 2004 The 6502 Architecture In order to understand the more modern computer architectures, it is helpful to examine an older but quite successful processor architecture, the MOS-6502.

More information

Problem with Scanning an Infix Expression

Problem with Scanning an Infix Expression Operator Notation Consider the infix expression (X Y) + (W U), with parentheses added to make the evaluation order perfectly obvious. This is an arithmetic expression written in standard form, called infix

More information

Memory Management. Memory Management

Memory Management. Memory Management Memory Management Most demanding di aspect of an operating system Cost has dropped. Consequently size of main memory has expanded enormously. Can we say that we have enough still. Swapping in/out. Memory

More information

The Effects on Read Performance from the Addition of a Long Term Read Buffer to YAFFS2. Sam Neubardt

The Effects on Read Performance from the Addition of a Long Term Read Buffer to YAFFS2. Sam Neubardt The Effects on Read Performance from the Addition of a Long Term Read Buffer to YAFFS2 Sam Neubardt My research project examined the effects on read performance from the addition of a long term read buffer

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

Field Analysis. Last time Exploit encapsulation to improve memory system performance

Field Analysis. Last time Exploit encapsulation to improve memory system performance Field Analysis Last time Exploit encapsulation to improve memory system performance This time Exploit encapsulation to simplify analysis Two uses of field analysis Escape analysis Object inlining April

More information