University of Edinburgh Division of Informatics

Size: px

Start display at page:

Download "University of Edinburgh Division of Informatics"

Marvin Cannon
5 years ago
Views:

1 University of Edinburgh Division of Informatics Developing an Alpha/Tru64 Port of Diablo Diablo Is A Better Link time Optimiser 4th Year Project Report Computer Science Peter Morrow: p.d.morrow@sms.ed.ac.uk February 27, 2007 Abstract: Traditional optimising compilers currently do a fairly good job of producing efficient assembly code. However they are limited with regards to the scope in which they can perform program optimisations, I.e. only the source modules which make up a program. A number of other restrictions are also in place when attempting to perform optimisations at compile time. In this dissertation I describe the implementation of a whole program optimiser for the Tru64 UNIX / Alpha platform which operates after the compilation and linking stages of program development. Performing optimisations after a program has been compiled and linked removes the restrictions in place at compile time, hence I have noticed on average 4.28% improvement in execution time and 13.23% in binary size.

3 0.1 Acknowledgements My supervisor Professor Michael O Boyle for his many excellent suggestions and guidance throughout the project. Research assistant Timothy Jones for his excellent assistance with a wide range of technical issues, ideas, and support. Bruno De Bus of the PARIS group at the University of Ghent for providing great support whenever I had a problem with Diablo. The Universitat Politecnica De Catalunya for allowing me access to their Alpha based systems running the Tru64 UNIX operation system.

5 Contents 0.1 Acknowledgements iii 1 Introduction Motivation Compiler Optimisation Project Structure Dissertation Structure Background Why Link Time Optimisation? Diablo Statically and Dynamically Linked Binaries The Alpha Architecture Alpha PALcode Related Work Classical Optimisations Optimisation Techniques and Trade offs Existing Tools OM ALTO Squeeze, Squeeze++ & Diablo Diota Link Time Optimisation Techniques Specific Code Optimisation Work Summary The Alpha Port The Diablo Framework The Link Layer Diablo Classes The Flowgraph Layer Front end and Back end Interaction Alpha Architecture Representation in Diablo Alpha Instruction Representation in Diablo Disassembly Instruction Formats Disassembly Functions Control Flow Graph Creation Basic Block Leader Detection [2] v

6 4.3.2 Addition of Edges to the CFG [2] Deflowgraphing (Code Layout) Creating Chains of Basic Blocks Fix Point Calculations Assembler Summary Optimisations Useless Instructions Removal Removal of Dead Code Removal of GP Re-Computations Removal of No-ops Branch Optimisations Elimination of Indirect Jumps Removal of Useless Conditional Branches Summary Results Test Environment Instruction Elimination Diablo Base: Dead Code Removal Only Removal of Nops Removal of GP Re-computations Combined Instruction Elimination and Dead code Elimination Conclusion Branch Optimisations Branch Conversions Combined Branch and Instruction Elimination Improved Combinations Summary Conclusion Further Work Optimisations Bibliography 67 A Code Listing 71 B Running Diablo 73 B.1 Compiling Diablo B.2 Preparing A Program For Optimisation B.3 Using Diablo

7 1. Introduction 1.1 Motivation Nowadays even as computer hardware becomes more freely available and also at a cheaper cost there is still the need for producing programs which use less disk space, run faster and which run with a smaller memory footprint. There is no point in programs not taking advantage of the advances in hardware! Embedded devices are also becoming more and more popular; these devices often have limited storage space and limited memory on board hence we need to produce more efficient binaries without sacrificing correctness. Optimisation programs after linking complements traditional optimising compilers. We can push optimisations to the limit by combining compile time and post link time optimisations, with this in mind it makes sense to go that extra mile and optimise at link time too. 1.2 Compiler Optimisation Compiler optimisation is the process of analysing programming language constructs and attempting to transform these constructs into more efficient forms. Efficiency in this sense relates to execution time of a program, the size of the program, the amount of power needed to execute a program on a given CPU. Most modern compilers include an optimisation phase which attempts to increase the efficiency of a program, a user may specify the optimisation level at which the compiler optimises code. The optimisation phase of a compiler takes place after source files have been converted into an intermediate representation. This intermediate form represents all of the syntactical and semantic information gained from the source files and eventually is used to build graphs with which analysis can be performed and hence transformations can be applied. Figure 1.1 shows shows this process. A compilers underlying purpose is to generate correct assembly code and nothing more. Other external tools are used to create a final binary which a user can run; an assembler is needed to transform the assembly code instructions into machine code instruction which are stored in object files (.o files). A linker is also needed to combine the generated object files and to fix any undefined references to code or data not within the current object file. For example there might be a call to the printf function in the generated assembly code. The machine code instruction has no idea were the printf routine is located, so there needs to be some method to locate the routine - when the program is linked 1

8 2 1. INTRODUCTION the linker will search its library paths for the printf routine and will replace all references to printf with a physical address with which instructions can reference this function. Usually this process is transparent to the user, for example the GNU C compiler will perform all of these operations on the users behalf unless they specify otherwise. Figure 1.1: Compilation Process In this dissertation I describe optimising a program after it has been compiled, assembled and linked. It turns out after link time or during link time (note, from now on these terms will be interspersed, they mean in the same thing in the context of this dissertation) we can perform even more optimisations even if the compiler was able to perform aggressive program optimisations. This is generally due to the information provided to us by the linker. This information allows us to, for example, calculate indirect jump targets. Information such as this can be used to improve program execution time by converting costly indirect jumps into PC relative jumps. An optimisation such as this cannot be performed by the compiler - this is since the compiler might not know the target of the jump, it could be in a far off library somewhere in the standard C library. It is the job of the linker, and not the compiler to determine this kind of information. Possibly the biggest advantage of performing optimisations after a program has been linked is the idea of a global view of the program. After linking we have access to all object files; these object files contain external functions which programmers use in everyday programs. Functions such as malloc, printf, qsort etc can all be optimised for specific calling contexts. Compilers simply cannot optimise these external library functions because they do not see the high level source code of these functions, they see nothing but the source which a user has supplied! This majority of this dissertation is devoted to the development of a port of an existing post link time optimiser to another architecture, the Alpha. The remainder of the dissertation focuses on the optimisations which are available to us at link time, I discuss the optimisations which I have implemented in the port as well as further optimisation related work which will eventually make it into the port. A link time optimisation tool such as the one which I have ported works by reading the a binary along with all object files which make up a program. With these

9 1.3. PROJECT STRUCTURE 3 files and a linker map file we can emulate the operation of a linker, essentially we re-link the object files and binary using the linker map file. Doing so gives us a global view of the program with which we can perform additional program optimisations. The operation of an optimiser which works post link is described in figure 1.2. This diagram essentially shows the portions of the port which need to be implemented. Thankfully the portion of the port which reads the inputs to Diablo was already implemented, therefore it was my task to implement the disassembly, assembly, control flow graph creation, code layout, and basic optimisation phases of the Alpha port of Diablo. A control flow graph is a representation of a binary program where nodes are blocks of code connected by edges which describe possible control transfers. This data structure is an underlying concept in an optimisation tool, we perform the analysis and optimisations phases on this generated control flow graph. Figure 1.2: Post link optimisation 1.3 Project Structure The main goal of the project was to port a backend of an existing post link time optimiser to another architecture. On completing this phase of the project I continued working on the code base and extended it to perform some program optimisations which have resulted in good reductions in both binary size and execution time. In the dissertation I have written a considerable amount on the research area that is compiler optimisation, this is so to attempt to familiarise the reader with many of the concepts in the field and in turn to ease them through the rest of the dissertation. I describe my implementation of the port which was completed in around 3500 lines of code in the C programming language. Following this I describe the optimisations which I implemented, how these optimisations affected program execution time and program size. Finally I give a summary of the project and lead to further work which might be undertaken on this project.

10 4 1. INTRODUCTION 1.4 Dissertation Structure I begin this dissertation with a chapter devoted to background information on the project. In this chapter I explain in more depth what exactly link time optimisation is, why it is useful and the applications of performing transformations after linking. Following this I give an introduction to the tool which I will be porting, I then differentiate between static and dynamically linked binaries. Finally I give an overview of the Alpha architecture pointing out any interesting features and also notable information about the architecture which will be assumed in later chapters. Chapter 3 Describes related work within the field of compiler optimisation, it also describes optimisation techniques and papers which are specific to post link time optimisation. I also devote some time to describe other tool which are used for object level modification. Chapter 4 is devoted to the implementation of the Alpha port. The chapter begins by introducing the reader to the Diablo framework, this is followed by implementation details of the actual port itself. Chapter 5 describes the various optimisations which I decided to implement having completed work on the backend of the port. Here I mentioned a few optimisations which related to code size one which related to improving execution time. Chapter 6 reports on the results observed from applying the optimisations on a set of benchmarks. The benefits and pitfalls of each optimisation is looked at in depth. Chapter 7 is the final chapter - I give my conclusions on the success of the project and mention any further work that could be merged into the Alpha port. Appendix A gives a very brief description of the source files which I wrote, this is primarily to point someone who is interested in taking on this work in the right direction. Appendix B gives guidelines for using the tool and preparing programs for optimisation after linking.

11 2. Background In this chapter I discuss information which I think plays an important role in the project. I begin by explaining in more depth just exactly what post link time optimisation is and why we should perform it. Following this I introduce the Diablo tool and talk a little about binary files. I finish the chapter by describing the underlying architecture to which Diablo will be ported. 2.1 Why Link Time Optimisation? Typically code optimisation falls within the realm of the compiler. This is since the compiler has access to much more semantic and syntactical information such as data types and data structures. Inter procedural optimisations are possible at compile time, however they are restricted to the procedures within a single source code file. For example, external libraries such as the standard C library which are linked to a program cannot be optimised within this specific programs context. Generally aggressive procedural and inter procedural optimisations will result good results with regards to speed and code size, however there are many opportunities for even better results if we think about performing transformations after link time. At link-time we have access to all of the object files which make up a program so we are able to perform classical optimisations at the basic block, instruction, and whole program (external libraries and all) levels. Hence we can expect to see gains in many areas of program efficiency such as speed, binary size and power consumption. Link-time optimisation sounds like a perfect solution to getting the most of out a binary, however there are some drawbacks to using this approach to perform code optimisations: Optimisation at the object level is non-trivial. It can be difficult to extract any real meaning from an object file (which in essence is really just a series of machine code instructions) whereas with a source file it is easier to analyse the semantics of a piece of code. Having semantic information about a program makes it easier to perform intelligent optimisations. At source level strange code constructs are rare (programmers usually do not write strange code), so analysis at the source level is generally easier. At machine code level, the compiler may generate strange code and hence strange code sequences are much more frequent. This makes our task that little bit harder. Generally the generated binaries are larger than the source files, couple this 5

12 6 2. BACKGROUND with the difficulty of object optimisation and the problem becomes harder again. Of course there are advantages to link-time optimisation as opposed to compile time optimisation: We can be language independent! Object file formats obviously do not differ from language to language, so we can effectively apply our optimisations to any source language. We can perform optimisations which are not available at compile time such as jump conversions. In this example the compiler can often not determine the target of a jump hence the most conservative jump instruction must be chosen. On the Alpha this will usually translate to either a JMP or JSR (jump sub routine) instruction which are costly because they need to read from memory before they perform their jump. If we know the target (which the compiler won t) we can replace these instructions with other more efficient instructions. We can perform whole program optimisations. That is we can optimise functions and instructions that we did not write, I.e. those in external libraries. Optimising at link-time is the only way to optimise an entire program. In this dissertation I only consider the application of optimisation after link time however many other applications of post link binary re-writing exist and may be useful in certain circumstances: Translation to a different architecture - Since we have all the information that a program has to offer, complete binary translation to another architecture can be performed. DEC (now Compaq) successfully created a binary translator to run x86 binaries on DEC s Alpha systems[5], Intel and Transmeta corporation have also created similar translation systems. Customisation - Say a processor developer updated their instruction set to include some new instructions. Usually you might have to wait for the compiler to add support for these instructions, so to take advantage of these new features you might have to wait a long time. By using a link time translation program, we could easily add support for the new instructions quickly and easily. Another example might be replacing instructions that may have been removed from an instruction set with instructions that are valid. Instrumentation - A common reason for instrumenting a binary is to obtain profile information. For example, finding the hottest paths through a program for common cases. Obtaining information such as this can be used to create a profile which we can use to optimise specific parts of the

13 2.2. DIABLO 7 program. The FIT (Flexible Instrumentation Toolkit) [3] tool developed by the PARIS group at the University was created to do just this. This tool is a member of the Diablo family, however sadly a port to Alpha does not exist at this time. Another application of instrumentation is checking the safeness of a program. For example the usual but slightly dated method for writing a buffer overflow is to change the return address of a function to point a piece of code such as a setuid shell. It would be possible post link to check that all return addresses lie in a valid area of memory. Code Obfuscation - Code obfuscation is a technique which makes binary programs harder to understand and hence harder to reverse engineer. Performing obfuscation at link time yields much better results that at compile time, this is since as usual, external functions can be obfuscated. Calls to functions in a static context can be disguised as calls to other functions. Loco [17] is a code obfuscation tool developed by the PARIS group at the University of Ghent, Belgium. 2.2 Diablo Diablo (Diablo Is A Better Link-Time Optimiser) is a retargetable binary rewriting framework developed by the PARIS 1 research group at the University of Ghent in Belgium. Currently ports exists for MIPS, x86, ARM, IA64, and currently only statically linked programs are supported for optimisation. Diablo was designed to fix many of the problems that were introduced in other link-time binary re-writers such as lack of support for multiple architectures, correctness, safeness, and extensibility. Diablo supports multiple architectures by providing a generic interface with which one can describe a new architecture, many utility functions which are architecture independent exist and help to make the job of a person writing a port a little easier. Diablo is guaranteed to be safe because of the information that is available to it at link-time. Information about relocation s is available and hence we can guarantee that the output binary will be correct. Adding new features to Diablo is also an easy task, generally adding a new feature involves no more that actually writing the feature. There is no need to trawl through the complex internal workings of Diablo. Diablo is conceptually split into two half s, the front end and the back end. The front end is an architecture independent environment in which generic functions call architecture dependant functions within an architecture back end in order to output a final translated binary. Diablo takes as its inputs an original binary, which has been statically linked with the necessary libraries, the libraries (.o and.a object files), and finally a linker map file. The linker map file provides 1 PARIS research group -

14 8 2. BACKGROUND information to Diablo which is necessary to emulate a linked program. The map file contains references to library functions which the original binary calls, with this information Diablo is able to create an emulated statically linked binary. This dissertation describes the implementation of a back end of Diablo for the Alpha architecture using Digital Unix ECoff binaries and the Tru64 UNIX operating system. I also implement some basic optimisations in an attempt to improve program execution behaviour. 2.3 Statically and Dynamically Linked Binaries When a program is linked statically all references within the program are restricted to the final program binary. This means that all a user needs to run the program is the single binary file, the binary contains everything it needs to know about the program. On the other hand, dynamically linked programs contain references to external libraries. These references need to be resolved at run-time on every execution of the program hence performance usually takes a hit when compared to statically linked programs. With statically linked programs we have available to us all the information relating to the program and its libraries hence we have a good opportunity to perform global optimisations. Statically linked binaries however are much larger since we are cramming all of the information from other libraries into one final binary. Although this sounds bad it is cancelled out by the fact that we need need not have a copy of the external libraries to run the program. A program could easily be optimised on a machine where disk space requirements are not an issue, the optimised binary could then be placed on a embedded device with little hard disk space where it would run with no problems. This is all well and good if there are only a few programs which need to reside on the resource lacking device. If the device needs access to a large range of programs then statically linked binaries are not the way to go. In [15] the main problems associated with linking programs dynamically are described and solutions to these problems are posed. 2.4 The Alpha Architecture The Alpha [7] [12] is a true 64-bit super scalar RISC architecture which has been designed with performance as its main focus. There are 64 registers available to use, all of which are 64-bits wide. Registers $0 through $31 are integer registers, registers $32 through $63 are floating point registers. Registers $31 and $63 are

15 2.4. THE ALPHA ARCHITECTURE 9 hard coded to the values 0 and 0.0 respectively. Some of these registers by convention have standard uses: Register $29 ($gp) is used to index the global address table (GAT) which is a table of 64-bit constants. Each function maintains its own GAT, hence re-computation of the $gp (global pointer) register is performed on entry and exit from every function. This property of functions is something that will be looked at in detail later in the dissertation. Registers $0, $32, $33 are generally used to store return values of functions. Registers $16... $21, $48... $53 are used to store function parameters. Functions with more that 6 parameters should use the stack to pass their parameters. This would be useful information when attempting to implement function inlining optimisations. Register $26 ($ra) stores the return address of a function, I.e. this register is saved every time a function is entered so the program can continue execution from the correct location once the function has returned. Register $30 ($sp) is the stack pointer. Register $27 ($pv) is a pointer to the currently executing procedure. Endianess on Alpha systems can either be little or big, however for this project I am assuming that switched endianess has not been enabled so all operations use the little endian byte addressing mode. Enabling support for switched endianess is trivial within Diablo and is a possible candidate for further work. The instruction set is a fixed size IS with all instructions being encoded as 32-bit unsigned integers, instructions are grouped into 5 high level formats: PALcode format. Branch format. Operate format. Floating point operate format. Memory format. Slight variations on these formats exist for instructions such as as unconditional branches and operate instructions with an immediate operand. Instruction format will be mentioned more in the design and implementation of the disassembler and assembler sections of this paper. An instruction format describes which bits of the instruction word are used. For example, all instruction formats use the 6 most significant bits to store the instruction opcode.

16 10 2. BACKGROUND Alpha PALcode Both software developers and operating systems developers require consistent functions across different implementations of the Alpha architecture. However there may be subtle differences in hardware implementation which a developer might like to take advantage of. For example context switches, memory management functions, power management, emulation of instructions with no hardware support etc. PALcode is used to implement such functions using standard ISR instructions which reside in main memory. Other implementations may support such functions in hardware whereas the designers of the Alpha architecture chose to allow modification of such functions via the PALcode interface. PALcode routines can be thought of as software interrupts (SWI) since they require complete control of the machine, interrupts to be disabled. During the port of Diablo we need not worry too much about PALcode libraries, we should just assume that they alter control flow like a regular branch instruction. This is important when it comes to creating the control flow graph.

17 3. Related Work Research in the field of compiler optimisations have been on going for a very long time. Hence there are many related areas of work. So, in this chapter I will attempt to give a quick round up of related in the field and more specifically work related to optimisations and tools which are used post link time. 3.1 Classical Optimisations Classic optimisations are the optimisations which everyone has heard of, they form the basis of today s optimising compilers. It is important to mention them in the context of optimising a program post link time since many of the compile time classical optimisations can be re-applied after linking. The linking process is very beneficial to us, it reveals a lot of interesting information which in turn gives us the opportunity to perform classical optimisations again. Typically a compiler will perform a couple of rounds of classical optimisations, after link time we can do yet another round(s) of classical optimisations to increase program performance Optimisation Techniques and Trade offs In this section we discuss various traditional compile time optimisation along with their link time variants and why these optimisations are viable at or after linking. It is also necessary to introduce these terms in a simple form at this stage in the dissertation since these term will be mentioned heavily throughout the remaining sections Register Allocation Register allocation is an optimisation which attempts to increase the the number of variables which are stored in registers, this is since loading and storing values from and to memory just before they are needed is much more costly than direct register to register operations. Performing register allocation in turn gives us more opportunities to perform other optimisations such as peephole optimisations and strength reduction (the process of replacing costly instructions with cheaper instructions ones). There are two main approaches to register allocation, the older method is based on graph colouring [24]. The newer method is named 11

18 12 3. RELATED WORK linear scan [23]. Graph colouring will produce better register allocation schemes however it takes a much longer time to determine register a scheme. This is not a problem for the case of binaries which we plan to optimise with the port of the optimiser which I have written, however in the case of dynamic compilation strategies such as JIT compilation speed of register allocation is important [27]. Linear scan offers much better speed when producing register allocations however the number of variables which are not stored in registers for longer may suffer. What I have described thus far suffices for register allocation at compile time. It turns out we can perform register allocation again after the linking process! The idea is the same except we can allocate perform register allocation on a global scale. This means liveness information can be propagated between module boundaries, hence a more accurate register allocation scheme can be generated. The term liveness was mentioned briefly but was not explained, I will attempt to explain this idea as it is central to register allocation strategies. A variable or register is said to be live at a specific program execution point if it holds a value which might be used at a later point in its execution path [2]. At its most basic register allocation is the task of assigning live variables into CPU registers where possible. Liveness analysis needs a control flow graph to determine live variables within a program, during my implementation we create a control flow graph with which an improved register allocation scheme could be implemented Peephole Optimisations This class of optimisation is used on small blocks of nearby instructions. In his paper [19] McKeeman first describes peephole optimisations and gives a number of examples. They are so called because examining close by instructions in a block of instructions is said to be like peeping through and hole and peering around. Some examples: X := Y; Z := X + Z Will generate assembly code something similar to: LDA y, (Y) LDA x, (X) ADD Z, x, Z ; Useless, eliminate. Here the 3rd instruction is not required, the add instruction would also be changed to ADD Z, y, Z. Performing these optimisations after linking as well as during

19 3.1. CLASSICAL OPTIMISATIONS 13 compilation should yield performance gains as well as code size reduction. However, removing instructions introduces a problem with regards to jump targets and basic block positioning. These problems can be resolved during the code layout phase of writing the final binary after optimisations have been performed Profile Guided Optimisations Profile guided optimisations (PGO) can be used to tune the performance of a given program by detecting bottlenecks in a programs execution. A compiler which supports profile guided optimisations such as Intel s C/C++ compiler needs to run twice over the source program. The first run of the compiler needs to instrument the program to reveal information such as instruction execution counts or basic block execution counts. The binary must then be run with the instrumentation code added, the information obtained is then fed back into the compiler with the original binary. The compiler can then make more informed decisions about where it optimises the program and which optimisations it applies. Typically programs which contain many branches, for example a checking the return value of a function call against a wide range of values, will perform best with PGO. Programs which are generally fixed in their operation will perform better with PGO when compared to programs which have a much larger possible range of outcomes. Post link time PGO extends on compile time PGO. We can instrument actual library functions and generate profile information using instruction and basic block reference counts etc. Imagine a library function which performs a lot of I/O, this function might open files, create sockets etc. A library function like this might greatly benefit from PGO since usually when a programmer writes an I/O function then check for correct return values, they might even check the value which ERRNO was set to. If the program which calls this function generally does not deviate too much, I.e. the I/O functions which are called usually perform and return the same values then PGO will work excellently in this case Unreachable Code Elimination Unreachable code is code to which there is no path of execution. We will see in later chapters of this dissertation (control flow graph creation) how control transfers are represented in a control flow graph. In the example below, clearly line 8 is unreachable. Constructs like this are detected after the control flow graph has been created, this is one of the easier optimisations to perform and we will find that the optimiser which I am porting already performs this task for us. 1 int func()

20 14 3. RELATED WORK 2 { 3 int x = 10, y = 5, z = 20; 4 5 x *= 2; 6 y = x + z; 7 return (y * y); 8 y *= z; 9 } Line 8 is clearly unreachable, therefore it makes sense to remove this instruction Branch Optimisations Branches are usually the biggest cost inducers in a program since they need to access and set a couple of registers or even memory in some cases. In some cases we can replace branch instructions with non branch instructions such as conditional moves, here we should a see significant performance gain. Branches which need to access memory are called indirect jumps and are the slowest kind of jumps on the Alpha. They need to access the GAT which is a costly operation. It turns out that we can modification instructions of this type if we can figure out the target of the jump. The modification involves changing the instruction to a less costly PC relative jump such as a BR or a BSR. We will see later in the paper that an optimisation like this performs very well. This is due to the fact that these indirect jumps are the instructions which are sometimes used to call a procedure. Since there will be many calls to procedures within a programs lifetime we can greatly reduce execution time by making modifications. 3.2 Existing Tools There has been a lot of work in the field of post link time optimisation in the last decade or so. The topic of link time optimisation was first mentioned in a paper [28] written by Wall at DEC s lab in Palo Alto, California. In his paper Wall describes an intermediate language for their experimental 32-bit machine (Titan) named Mahler. Mahler was designed to be a compromise between traditional assembly code and a compilers intermediate representation, I.e. the Mahler language was used to represent all languages. The Mahler system was comprised of a Mahler assembler which would take a Mahler source file (either hand written or compiler generated) and produce object files containing disassembled Mahler instructions. The assembler in this case unlike modern day assemblers generated heavily annotated code. The assembler would record a list of variables defined

21 3.2. EXISTING TOOLS 15 and procedures called within a module, relocation information was also stored. Storing relocation information at this stage meant they were able to employ a smart register allocation scheme at link time. This scheme was used to determine if certain load and store instructions could be promoted to registers rather than stored in memory, hence giving a certain performance boost. The most interesting portion of the Titan/Mahler system was the optimising linker. The linker was able to perform interesting low level and high level optimisations such as register allocation and instruction scheduling. The information retained by the Mahler assembler was very useful, source level constructs were annotated and therefore the linker could perform surprisingly sophisticated optimisation at which is a very late stage the process of generating a final binary. Most subsequent work in the area of link time and post link time optimisation was based on the ideas presented in this paper OM Tools such as OM [25]which is a another link time optimising system for DEC s Alpha architecture sprung up after Wall s paper. OM was developed by the same researchers at DEC s lab in Palo Alto, and is much more practical than the mostly theoretical approach using the Mahler system which was described in [28]. OM s approach is nearing the same approach which a modern day link time optimiser such as Diablo takes. OM reads all the libraries and binaries which constitutes a program, the code within the program is converted into a Register Transfer Language (RTL). The RTL is used to represent the code that must be generated, it is essentially a bridge between a high level language and assembly language. RTL s are usually specified in LISP style syntax and are usually implementation independent, that is one could look at an RTL description and not know the target architecture instruction set. The RTL is then converted into a control flow graph, nodes in the graph are basic blocks and edges in the graph are possible paths of execution. Once a control flow graph has been created OM performs global code optimisations. It can perform these global optimisations since the entire program is represented by the control flow graph. In [21] they describe solutions to interprocedural liveness analysis by using an interprocedural control flow graph. OM used this technique along with several others to perform liveness analysis of registers during the linking phase of building a program. The developers of OM wanted a light weight optimiser which performed post link time optimisations, since this was a goal they have sacrificed the number of optimisations which are performed on a program. Many other optimisations which could be performed after linking were skipped, OM does however implement the following optimisations [20]: Register allocation, instruction scheduling, dead code elimination, peephole optimisations, OM also supports profile guided optimisations after linking. Overall OM on average saw around 5% gains in program

22 16 3. RELATED WORK effeciency, sometimes seeing gains as good as 12% in some cases [25] ALTO After the eventual stagnation of the OM tool yet another tool named Alto [20] appeared as result of Robert Muths PhD dissertation from the university of Arizona. Alto strived to be a much more complete link time optimisation tool and succeeded in bettering OM in terms of results quite considerably. Alto included the following optimisations within its framework: Code inlining, instruction scheduling, interprocedural liveness analysis, peephole optimisations, instruction and block elimination, instruction normalising, constant propagation, code compression. With this wide range of optimisations Alto saw excellent results in terms of program execution time and code size. On average Alto sees gains of up to 38% in terms of code size and gains of up to 6% on average in terms of execution time [20]. However Alto is not very portable, it only works on statically linked Tru64 UNIX ECoff binaries. As far as I am aware a port to GNU/Linux and ELF binaries was planned but was never released to the public. Alto also suffered from incurring a large memory overhead when it ran, this is since all data in Alto was allocated statically on the heap. There is no dynamic memory allocation using standard C functions such as malloc(3) and free(3) Squeeze, Squeeze++ & Diablo Squeeze and Squeeze++ [11] [9] followed from Alto and focused mainly on code compression issues which were becoming more and more important at this time due to the proliferation of embedded devices on the market. Finally Diablo seeks to take the best parts of all the tools already mentioned whilst incorporating new ideas. Many of the concepts in the other tools are very outdated, Diablo seeks to bring new ideas together as well as reusing the techniques which were successful in the older link time optimisation tools. Diablo shows code size reduction of approximately 15% to 5% on the ARM platform. The benchmarks which were used to produce these results were compiled with an incredibly aggressive industry grade optimising developer suite release by ARM (ARM developer suite), this needs to be taken into consideration before thinking in detail about the results Diablo has shown on the ARM platform.

23 3.3. LINK TIME OPTIMISATION TECHNIQUES Diota Diota [18] (Dynamic Instrumentation, Optimisation and Transformations of Applications) is a tool which works with dynamically linked programs which is in contrast to all of the previous tools which work only with statically linked programs. Instrumentation has been briefly covered in the PGO section of this paper, therefore I will only briefly describe the instrumentation which Diota performs. Diota can insert instructions at specific places in a program in order to reveal information about the program which is otherwise unavailable until the program is actually executed. Using this technique hot code can be discovered and hence optimised more aggressively or more specifically with regards to the codes context. Instrumentation in Diota however can be used for more than just this. For example temporarily fixing errors without recompilation, or adding support for newer instructions offered by the target native instruction set. Gathering information about a programs execution is referred to as passive instrumentation whereas actually modifying the final binary program is known as active instrumentation. Diota is able to monitor all memory access, hence is able to detect race conditions. Race conditions in programs such as Mozilla and Konqueror have been detected whilst under the control of Diota. 3.3 Link Time Optimisation Techniques The tools mentioned in the previous section show good results for code compaction and similarly code results for execution time. How exactly does an optimiser reduce the size of a binary by in some cases close to 50%? Execution time is also reduced. In this section of the dissertation I introduce some the related work in the field of post link optimisation which has led to such impressive results Specific Code Optimisation Work Most C/C++ development tool chains only perform optimisations at compile time. And as we have previously mentioned the compiler is limited to only source code modules when performing optimisations. Modules can be seen as boundaries between different parts of a program and in turn can also be seen as optimisation boundaries[9]. Typically linkers perform little or no optimisations, the Intel C/C++ tool chain is however an exception to this rule. Usually a linker will only resolve dependencies in programs by searching libraries for external functions ignoring any opportunities for performing global optimisation. Another less conspicuous problem is the advocacy of code reuse in the software engineering community. Students are also told to write code to handle every

24 18 3. RELATED WORK situation, programs also may have functions which were implemented for use in case a new situation arose. Often these cases and functions are never used hence binaries become bloated with references to useless objects [9]. The compiler cannot remove these cases because it does not know in which context the program is called, calling contexts are visible to us at after linking since we have a global view of the program. We know which objects are useless if we perform control flow analysis, hence we can eliminate them. The other ports of Diablo currently employ most modern code size reduction methods: Post link time value analysis [9] statically determines whether a registers value, independent of the programs input, is either constant or can take a value from some restricted set. Value analysis is used to remove portions of code which we can determine to never been executed, or to remove portions of code who s result is already known. A good example of value analysis is constant propagation [4]. Compilers are able to perform only limited constant propagation since they do not know anything about the data and code addresses which are determined at link time. Constant propagation algorithms work in the same way at compile time and post link time, at link time we just have a much wider view of the program and hence much more access to information regarding constants. We can also perform post link time register re-allocation schemes. De Sutter et al [10] make improvements to backwards liveness analysis and introduce forward liveness analysis. Combined forward and backwards liveness has resulted in excellent results on the Alpha architecture; the analysis results in finding 62% more than the previous attempts at finding live registers. This translates to an amazing conclusion that on average only half of the Alpha registers are used. Since compilers compile source modules separately they must adhere to calling conventions. Calling conventions are defined in an application binary interface (ABI) and define conventions such as which registers must be saved and when, which registers are used to store parameters which are passed to functions, and which registers are used for storing return values etc. The method described in [10] works by bending these calling conventions. This is safe to do since at link time we can see the whole program as one, code does not jump around from module to module. Therefore it is safe to abuse calling conventions as long as it does not sacrifice correctness. Now, given calling convention information the advanced interprocedural liveness analysis can be performed. In [10] they improve on state of the art liveness analysis described by Robert Muth in [20], this technique was used in Alto, Squeeze and Squeeze++ [11] [9] with good results.

25 3.3. LINK TIME OPTIMISATION TECHNIQUES Smart Code Layout Code layout refers to the process of converting the nodes (basic blocks) within a control flow graph into a straight line program which can be executed on the target machine [2]. The general strategy is to place blocks in which execution passes straight through, that is blocks which do not end in jumps. Jumps are costly instructions which degrade I-Cache (Instruction Cache) performance [12]. I-Cache performance will be greatest when instructions which are close together are also layed out in memory close together. This is the principle of cache locality, and holds true since instructions which are located close together are likely to refer to the same registers and memory locations. Therefore we can increase a programs performance in terms of execution time by laying code out in a way which is complementary to cache locality. In [22] they describe an algorithm which used profiling data to help layout basic blocks in an efficient manner. Within most programs there will be paths of execution which are frequently taken, and paths of executions which are rarely taken. Traditional compilers will often place these blocks together and in doing so minimise the effect of a caching system. What Pettis et al describe is a method of eliminating these cases by moving the infrequently executed paths out of the way of the frequently executed paths, I.e. increasing cache performance. Their method also expects longer period of time before a branch is taken which again will improve I-Cache hit rates. Their method uses a weighted di-graph, each node in the graph is a basic block and edges represent a control flow path between two basic blocks, I.e. a jump from one to another or a fall through (straight line) edge. Each edge is weighted using the data gained during the profiling run, a weight between two nodes represents the amount of times the choice was taken. Their algorithm starts by finding the edge of largest weight in the graph, the nodes which are connected by this edge are chained together. In the graph the nodes which were chosen for chaining are merged together and the edges which led from the nodes are coalesced. This process is continued until the graph contains one or nodes with no edges. This method results in excellent I-Cache usage improvement, benchmarks which were tested on the Hewlett & Packard PA-RISC architecture and showed a 98.2% improvement in I-Cache performance. The method described above and in [22] performs very well however the algorithm is rather old and there have been further improvement in the area of code layout. One such new technique which is forming a new area of search in compiler optimisation is the use of artificial intelligence. In [16] a novel approach to finding optimal code layouts using neural networks has been developed. The approach in the paper is an architecture dependant approach where, profiling information is also required for proper implementation of the algorithm. The profiling information is used to determine features of the neural network, information such as instruction type, the direction of the branch (either forward: positive displace-

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating