ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

Size: px

Start display at page:

Download "ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation"

Shannon Thompson
5 years ago
Views:

1 ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating virtual machine. The ABI virtual machines such as FX!32 [1], SUN WABI [2], and SHADE [3] use binary translator to translate application binaries with an ISA different from hardware platform so that they can be executed on that hardware platform. Some system uses binary translator as a component of dynamic optimization. In this paper, we study the effective of binary translation applied to two ISA that have different configuration of register files. We chose MIPS instruction set as a base ISA that has a flat register file for its simplicity and generality and CRAY 2 instruction set that has a hierarchal register file structure as a target ISA. 1. Introduction In VM architecture design, the emulation is a very important method to enable a (sub) system to present the same interface and characteristics as another system. By executing and tracing programs, the emulator can help build better computer hardware and software. As one of the ways to implementing emulator, Binary Translation is not only efficient for repeated instruction executions, but also can solve the software compatibility problem, which is becoming more and more serious nowadays. For example, one of the major impediments to using a VLIW (or any new ILP machine architecture) has been its inability to run existing binaries of established architectures. Code scheduling plays an importance role in increasing ILP available in the program. However, aggressive scheduling technique has high register requirement. In addition, the use of aggressive processor configurations tends to increase the number of register required by software pipeline loops. The flat register file organization traditionally used in the design of microprocessors does not scale well when the register file requirements and the number of ports required to access it are high. In [4] present an alternative design for register file of future aggressive VLIW processor, a 2-level hierarchical register file, which combines high capacity and high number of ports access with low 1

2 access time. Higher capacity reduces spill code and allows the software to make aggressive code scheduling and optimization. Our goal is to develop and binary translator that map flat register file configuration to hierarchical register file configuration and obtain the characteristic and some initial requirement for translating between the two platforms. This paper is organized in the following manner. The next section discusses our implementation of major module consist in the project. Section 3 shows our experiment set up and results. Section 4 discusses related works, and finally, we present our conclusion and future work in section Methodology 2.1 Overview Figure 1 shows the overview of the component that we develop for this project. - The compiler that compiles the Java or C benchmarks to MIPS assembly code is already available. In this project we use a simple Java compiler which can compile limited Java programs to MIPS assembly code. The language features supported by this compiler include integer operations, logic operation, functional call which can also be recursive, variables comparison, string variables, result output (printf), variables assignment, branch instructions (if... else...) and while loop. Although it only supports a subset of Java or C language, we think it is enough for us because our focus is on binary translation. - We develop simple in-order simulators for MIPS and CRAY-2 in order to obtain the profile information, as well as the statistic information for comparison between the two platforms. These simulators also serve as correctness verification. - The parser compiles the MIPS assembly code to binary format and writes the result binary in to a file. - The translator translates the input MIPS binary in to CRAY-2 binary file, and perform optimizations during the translation. The translator is the main emphasis in this project and will be discuss in great detail. 2

3 Compiler Parser Translator MIPS Simulator CRAY-2 Simulator Correctness verification and Performance comparison Figure 1: Overview of the component in MIPS to CRAY-2 Binary translator 2.2 Instruction set MIPS instruction has a fixed length of 4 bytes. It has 34 registers. Besides the 32 general purpose registers, LO and HI are special registers that hold 64 bit results from multiplication and division instructions. CRAY-2 instruction set has variable length instructions and uses word addressing. It has 8 of first level register and 64 of 2nd level registers. Due to time constrain, we only implement a subset of MIPS and CRAY-2 instruction set and we only chose a use-level instruction. Appendix A gives detail information about our source and target instruction set. 2.3 Binary translator Our translator is a static translator. We have 3 levels of translator. Level 1 is a baseline translator that has no optimization. A level 2 translator enhances the base line translator by construct a basic block and add register allocation technique to reduce spills to second level register. Level 3 translator enhances level 2 by 3

4 construct a superblock and applies various technique of superblock optimization. Appendix B gives detail information of the map from MIPS instructions to CRAY-2 instructions Level 1 translator Since Cray has more level 2 registers than MIPS registers, all MIPS registers are identically mapped to Cray level 2 register. Level 2 register is loaded to level 1 register whenever it s used and write it back to level 2 register when the result is generated. The MIPS instructions are scanned and converted into CRAY- 2 instruction, then the result CRAY-2 instructions are scanned and fix all the link of change of control flow instruction. This scheme allows us to do the translation statically without any runtime information but not very efficiently level 2 translator We construct the control flow graphs of basic blocks. Each basic block contains several MIPS instructions. The new basic block is constructed whenever encounter branch and direct jump whose targets basic block haven t been constructed. The new graph is constructed whenever the translator encounters the instruction of Jump And Link (i.e. call). The termination condition for a graph construction is indirect jump (i.e. return), direct jump and branch whose targets are already been constructed, and the end of program. After all graphs are constructed, the graph-coloring algorithm is applied to allocation MIPS registers to 8 registers with minimum register spills. The translators then walk through the graphs, convert MIPS instruction to CRAY-2 instruction, and adjust all the links of change of control flow instruction. Because we translate the binary code statically, and we assume all indirect jumps happen only when functions return, we do not need to handle them specifically level 3 translators After transform the MIPS instructions into graphs of basic blocks and apply the graph-coloring algorithm, the superblock [5] is generated based on the flow graph of basic blocks and the edge profile information obtained by the simulator. We generate the superblocks conservatively, i.e. we do not try to duplicate basic 4

5 blocks when the down stream block is target of more than one block. Besides this, the criterion to enlarge a superblock is that the ratio of execution frequency between its two paths exceeds 2.0. If the conservative condition and ratio requirement are met, we enlarge the superblock along the path with higher execution frequency. Unconditional branch is removed and its target block is directly merged if the conservative condition is met. After the superblock is generated, now with bigger scope, the translator walks through all the graph of superblocks and converts MIPS instruction to CRAY-2 instruction. Then many code optimization techniques can be applied to the CRAY-2 superblock. As our translator is static, couple of optimizations such as code sinking, peephole and dead code elimination have been used at this level. The following figure explains the relationships between the three different level translations: superblock basic Block Level 3 Level 2 Natural Translator Level 1 reg s coloring code sinking, dead code, peephole Figure 2: Translator diagram Binary CRAY sequence As compared in section 3, each level gives different performance and tradeoff. 2.4 Simulator For better and more precise comparison between the codes before and after our translation, we construct two simple simulators, one for MIPS and the other for CRAY. The simulator only supports in-order execution and only one functional unit. The simulators has ability to generate edge profile information along with other statistic such as total instruction simulated, IPC, total number of bytes of instruction 5

6 simulated, total number of load and stores instruction, total number of inter and intra level register moves. These simulators also serve as correctness verification. 3. Experimental setup and result Since we only translate a subset of MIPS instruction sets, we use the benchmark generated from our compiler. This test benchmark contains approximately 1850 MIPS assembly instructions and performs different tasks, e.g. Fibonacci recursion algorithm, Dot product, tight mathematical loop, nested loop. This benchmark needs to be run for more than one million cycles before it finishes, and is sufficient enough to evaluate both the quality and quantity of our special translator running at different level. The register configuration that we use for both simulators are as follow. MIPS simulator CRAY-2 simulator Number of first level register 32 8 Number of second level register 0 64 Memory, 1 st, 2 nd latency level register 1 1 Table 1: Register configuration for both simulators We generate the MIPS binary of the benchmark and translate it with different translator, which has different level of optimization and run it on the simulator to compare the result from standard output to verify the correctness. The following figures summarize the results that we got from our experiment. Figure 4 shows the code size for the binary files. The code size increases after translation. The reasons are: First, even though most of CRAY-2 instruction consumes only 2 bytes while all MIPS instruction consumes four bytes, the most frequently use MIPS instruction such as ADDI, ORI, LUI, LW, ST requires more than one CRAY-2 instruction to emulate them. An important drawback of the CRAY-2 instructions is 6

7 that the arithmetic operations and memory operations can only have register operands, and can not operate immediate value directly. So to translate all of these instructions, we need at least one more instruction to load the immediate value to a temporal register, which will be 4-byte instructions and thus increase the code size. Also, some MIPS instructions have no directly corresponding instructions in CRAY-2 ISA. So we need more than one instruction to achieve the same results. Second, the small amount of level 1 register in CRAY-2 architecture introduces some overhead because extra inter-level register move instructions are required, which consumes 4 bytes each. Therefore, the code and byte code expansion are quite high. Code size of program generated from different translator Code size(byte) MIPS level1 level2 level3 Translator type Code size(byte) Figure 4: Comparison between code size in byte of MIPS binary, binary generated from level 1, level 2, and level 3 translator. From figure 4, the code expansion of the code generated from natural (level1) translator is about 4 times the source code size, which is mainly because the huge numbers of data transfer between the two levels of registers. The code size of natural translator can be reduced sharply by effectively coloring register allocation taking the advantage of 1st level registers. In our benchmark the code size is shrunk with the factor about 2.2 when apply optimization level 2. Optimization level 3 gives a slightly smaller code size than optimization level2 due to the removal of some redundant code. Figure 5 shows the simulation cycle results. Because the simulation cycle is proportional to execution time for the same clock frequency, it is an ideal matrix for performance measurement. From figure 5 we can see, 7

8 the performance of code generated from natural translator degrades by approximately 4 times comparing to the performance of MIPS code due to translation overhead. When applying the optimization level2, the performance improves substantially due to the reduction of inter-level register move instruction as shown in figure 6 and 7. Simulation Cycle of program generated from differnt translator Simulation Cycle MIPS level1 level2 level3 Translator type Simulation Cycle Figure 5: Comparison between simulation cycles from simulator executing MIPS binary, binary generated from level1, level2, and level3 translator. For the optimizations used in translation at level3 do not give out very big improvement in terms of code size and execution time. There are two reasons: 1) The superblocks we formed are too conservative. This limits the size of superblocks and the scope for optimization. This is also the reason that the code size does not increase after superblock formation. 2) Optimization schemes such as peephole, dead code elimination and code sinking are not powerful for reducing both size and layout. Because peephole has already been performed on the MIPS code, not too much chances for it will be introduced by translation. Code sinking will not reduce the code size, but it can reduce the execution time. In fact we believe the inlining technique will be very helpful for performance. But due to the time limitation we do not implement and evaluate it. Figure 6 and 7 show the experiment results of register read and write. Because MIPS only has one level of registers, we compare the register accesses of first-level registers in CRAY to the register accesses in 8

9 MIPS. In figure 6 we can see the similar trend for register read and write operation as that in figure 4 and 5. The graph coloring register allocation in level-2 translation effectively reduces the register read and write. However, as mentioned before, the limitation exists as that a fair amount of CRAY instructions can not operate immediate value, while the corresponding MIPS instructions can. So the increase of register access in inevitable. Number of level1 register Read/Write of program generated from different translator Number of level1 register Read/Write MIPS level1 level2 level3 Translator type L1 register Write L1 register Read Figure 6: Comparison between number of level1 register read and write from simulator executing MIPS binary, and binary generated from level 1, level 2, level 3 translator. Figure 7 shows the accesses to the second level registers, for three different levels of translation. It is quite reasonable to see the drop of accesses after graph coloring method effectively allocate first level registers. Figure 8 shows the results of memory load and store. The numbers don t change too much, except on level 3 we can see some reductions which come from the elimination of some redundant load and store instructions, and move some load and store instructions to off-trace blocks. However, the improvement is minor. Note that the number of load and store operations does not increase after translation, which is because we assume the same bus bandwidth, and one CRAY memory operation (load or store) can achieve the same effect as one MIPS memory operation. On the other hand, because we assume the memory load 9

10 and store only have one-cycle latency, the reduction of memory load and store instructions does not affect the total execution time too much. Number of level2 register Read/Write of program generated from different translator Number of level2 register Read/Write level1 level2 level3 Level 2 register Read Level 2 registe Write Translator type Figure 7: Comparison between number of level2 register read and write from simulator executing MIPS binary, and binary generated from level 1, level 2, level 3 translator. 10

11 Number of Memory Load/Store of program genertaed from different translator Number of memory Load/Store MIPS level1 level2 level3 Translator type Memory Store Memory Load Figure 8: Comparison between number of memory load and store from simulator executing MIPS binary, and binary generated from level 1, level 2, level 3 translator. 4. Related work In the last decades there are several emulators very successfully performed dynamic binary translations like SHADE [3], DAISY [10], FX!32 [1], and SUN Wabi [2]. As a fast instruction set simulator at user level code, SHADE runs on SPARC system and simulates the SPARC (Version 8 and 9) and MIPS I instruction sets. It emulates a target system by dynamically cross-compiling the target machine code to run on the host. The dynamically compiled code can optionally include profiling code that traces the execution of the application running on the virtual target machine. The profiling is extensible and may be dynamically controlled by the user. Another interesting point is that SHADE has translation cache and translation TLB. So the translations are cached for later reuse to amortize compilation cost. DAISY from IBM Watson Research Center focuses on the full system simulation, i.e. unlike SHADE, it runs on both user and OS level. At run-time DAISY dynamically translates code for a PowerPC processor into code for an underlying VLIW processor. Its translated VLIW code are saved and cached too so that when the same PowerPC code is later encountered, its VLIW translation can execute immediately without retranslation. FX!32 and SUN Wabi are both ABI VM and their ISA are differ from Intel X-86 ISA. However the FX!32 translate any 32-bit x-86 application that run on an intelx86 microprocessor running the Windows NT4.0 operating system to run on Alpha microprocessor that also running the Windows NT4.0 operating system 11

12 while the SUN Wabi translate some commonly used x-86 application to run on Unix. So the translation for the SUN Wabi is more involved and this is reflex in their primary goal that they only support some commonly use windows application and emphasize on correctness. FX!32 does not do any translation while the application is executing. Rather, FX!32 captures an execution profile on the first run time and store it in the disk. The translation in done offline using run time collected data from profiling to translate and optimization then put the result into the translation cache. The later executions use the translated code in translation cache and continue profiling. The SUN Wabi uses dynamic translation base on the advantage of the dynamic translation such as adaptive optimization can exceed static quality and dynamically generate application code. It re-translates each time the ABI is initiated so there s no persistence between executions. SUN Wabi does translation per interval and store in the code cache with FIFO management. 5. Conclusion and future work There are some conclusions we can make from this experiment. First, it is desirable to have large amount of first-level registers. In our experiment, the small amount of first level register becomes the bottleneck. We do not fully utilize the second-level registers. Because we do not implement register renaming and speculative execution, we only need the same amount of registers in the second level register file, as that in MIPS. However, the small amount of first level register introduces extra execution such as spills. On the other hand, if we implement register renaming and speculative execution, we do not think there will be much improvement because the increase of inter-register move will reduce the benefit brought by renaming and speculation. On the other hand, in our experiment, we only have one execution unit and the instructions can not be executed parallel. This structure is not suitable to have hierarchical register files. For modern hierarchical register design, there are more than one function unit and each functional unit has its own first-level register file. However, even in those designs with multi function unit, the translation should have an effectively way to distribute the register usages evenly. Otherwise the small amount of first-level register will still become the bottleneck. 12

13 Graph coloring register allocation algorithm can reduce the inter-register move overhead substantially. But it is time-consuming and it is not possible to be implemented for dynamical translation. Second, the conservative superblock formation does not bring us too much benefit. Although it limit the size expansion, it prevents us from discovering more opportunities for optimization. Third, the limitation that a fair amount of CRAY instructions can not operate immediate value directly brings much overhead. It increases the code size, register assesses, and consequently increases the execution time. For the future work, integrating the simulator with the translator and constructing the translation cache allow the dynamic optimization to be performed and can improve the quality of the translator even further than what we see in this project. It s possible that the hierarchical register file can help improve the performance of VLIW architecture as introduce in [4]. Therefore, it s interesting to study the effect of organizing the CRAY instruction in bundles and run them in VLIW manner. This requires the modification in simulator to VLIW simulator. 13

Virtual Machines and Dynamic Translation: Implementing ISAs in Software

Virtual Machines and Dynamic Translation: Implementing ISAs in Software Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Software Applications How is a software application