Advances and Future Challenges in Binary Translation and Optimization

Size: px

Start display at page:

Download "Advances and Future Challenges in Binary Translation and Optimization"

Gerald Parks
6 years ago
Views:

1 Advances and Future Challenges in Binary Translation and Optimization ERIK R. ALTMAN, KEMAL EBCIOG LU, MICHAEL GSCHWIND, SENIOR MEMBER, IEEE, AND SUMEDH SATHAYE Presented by Holly Ferguson

2 Can you define these terms? Introduction Dynamic binary translation = Dynamic optimization = (Various Current Architectural Solutions) Regular Hardware Architecture Solution Keywords: Binary Translation, Compilers, Dynamic optimization, ILP, Java, JIT, VLIW

3 Introduction Dynamic binary translation = is just-in-time (JIT) compilation from the binary code of one architecture to another. Dynamic optimization = is run-time improvement of code. (see the fig. below) (Various Current Architectural Solutions) Regular Hardware Architecture Solution Keywords: Binary Translation, Compilers, Dynamic optimization, ILP, Java, JIT, VLIW

4 Dynamic binary translation = is just-in-time (JIT) compilation from the binary code of one architecture to another. Dynamic optimization = is run-time improvement of code. (see the fig. below) Purpose Running Program Input Code Stream Translate and Optimize Output Code Regular Hardware Architecture Solution Keywords: Binary Translation, Compilers, Dynamic optimization, ILP, Java, JIT, VLIW

5 Dynamic binary translation = is just-in-time (JIT) compilation from the binary code of one architecture to another. Purpose Translate and Optimize Running Program Input Code Stream Translate and Optimize Instead Rules: Variety of Regular Regular Hardware Hardware Architecture Architecture Solution Solutions Keywords: Binary Translation, Compilers, Dynamic optimization, ILP, Java, JIT, VLIW

6 Why is BT a negative? Binary Translation Addresses: Purpose Allows Architecture to become a Layer of Software As SW, fixes problems of running legacy SW directly Enables Optimizations outside existing hardware boundaries Commercial & Research Interest BT is done automatically at run-time without programmer Saves POWER since memory uses less power than Logic (for non-superscalar) Paper analyzes questions with projects including: DAISY, Crusoe, Dynamo, and LaTTe

7 Why else is BT a negative? BT & VMM have additional drawbacks: Negatives IT-apps, commonality, or virtual IT shops, etc. may mean disruptive behavior VMM debugging difficult: target is several times removed from source = behavior is nondeterministic Takes memory and resources from the source arch. machine Takes cycles from source arch. programs New design territory (2001), so calibration is difficult. Start of prgm. exe is slow, since all code is interpreted and translated to target arch. code Overtaking Hardware is a concern large as well Running Program Input Code Stream Translate and Optimize Output Code Regular Hardware Architecture Solution Virtual Machine Monitor (VMM) CMS (Code Morphing Software) for VMM Translation Cache (TCache)

8 Why else is BT a negative? a large focus for: Difficult Issues with managing BT: Self Modifying Code Precise Exceptions Address Translation Self Referential Code Management of Translations Real-Time Behavior Boot and BIOS Code Reliability and Correctness Code Reuse Interpreting versus translating/optimizing Emulating a virtual versus a real machine Full system versus user mode only; (IEEE, NOV. 2001) OS independent versus OS dependent Translating to a different versus same architecture Emulating single versus multiple source architecture (Referred to as Source (legacy) to Target Arch.) DAISY, Crusoe, Dynamo, and LaTTe focus on: talked about in terms of: Negatives Virtual Machine Monitor (VMM) CMS (Code Morphing Software) for VMM Translation Cache (TCache)

9 Why is BT a positive? Attractions to the study of Binary Translation: In pursuit of Architecturally Independent Computing using the run-anywhere object code idea Variety of Customers Farm = Variety of Architectures Positives Paper analyzes questions with projects including: DAISY, Crusoe, Dynamo, and LaTTe

10 Attractions to the study of Binary Translation: In pursuit of Architecturally Independent Computing using the run-anywhere object code idea Variety of Customers Farm = Variety of Architectures Gives a Static Total Positives Determines a Static Breakdown Paper analyzes questions with projects including: DAISY, Crusoe, Dynamo, and LaTTe

11 Attractions to the study of Binary Translation: In pursuit of Architecturally Independent Computing using the run-anywhere object code idea Variety of Customers Farm = Variety of Architectures Gives a Static Total Positives Determines a Static Breakdown Means Limited Utilization Thus Increasing Cost Paper analyzes questions with projects including: DAISY, Crusoe, Dynamo, and LaTTe

12 Attractions to the study of Binary Translation: In pursuit of Architecturally Independent Computing using the run-anywhere object code idea Variety of Customers Farm = Variety of Architectures Gives a Static Total Positives Determines a Static Breakdown BT is a Solution for better Utilization if it is a Layer of SW and thus dynamic configuration of many machines of a farm. Means Limited Utilization Thus Increasing Cost Paper analyzes questions with projects including: DAISY, Crusoe, Dynamo, and LaTTe

13 Why else is BT a positive? BT permits such optimizations under the covers when the user runs the program. Positives BT is not limited and can cross boundaries such as indirect calls, function returns, shared libraries, and system calls. With BT, intelligence is in software, not hardware. This means smaller chips with higher yield. Only a software patch for better algorithm is needed to install them and update the VMM. A software patch is sufficient to fix a bug in the VMM. Bugs such as nonworking opcodes may be manipulated by changing the VMM software. Running Program Input Code Stream Translate and Optimize Output Code Regular Hardware Architecture Solution

14 Why else is BT a positive? BT permits such optimizations under the covers when the user runs the program. BT is not limited and can cross boundaries such as indirect calls, function returns, shared libraries, and system calls. With BT, intelligence is in software, not hardware. This means smaller chips with higher yield. Only a software patch for better algorithm is needed to install them and update the VMM. A software patch is sufficient to fix a bug in the VMM. Bugs such as nonworking opcodes may be manipulated by changing the VMM software. Legacy binaries where source code is unavailable can be optimized. Translated basic blocks can be laid out contiguously in a natural order. This improves instruction cache performance. Future architectural improvements are transparent to the user. Compatibility of VLIWs of different sizes and generations. Positives Running Program Input Code Stream Translate and Optimize Output Code Regular Hardware Architecture Solution

15 What is the best solution? With software convergence, BT JIT optimizations admit the possibility of convergence virtual machine (CVM): JVM: Java Virtual Machine Write-Once, run anywhere model Existing C/C++ apps and OSs do not run on JVM, nor does Linux For Security and safety guarantees, gives up universal ability to handle other HW problems CVM: Similar to the JVM (Same Goal) Different Tradeoffs: Research works to allow same OS and app object code to run on different platforms with CVM Is universal because of/ through JIT compilation and virtual device emulation = less protection than a modern RISC processor CVM The Internet has recently been changing the software landscape [radically] and has been implicitly encouraging write-once, run-anywhere software and interoperability of different hardware platforms, as exemplified by the recent popularity of technologies such as XML, Simple Object Access Protocol (SOAP) [17], and Java. ~Altman

16 What are future implications? Linux Apps Compiled for CVM Linux Compiled for CVM Linux Apps Compiled for PowerPC or x86 Linux Compiled for PowerPC or (diff. for x86) CVM CVM for PowerPC or x86 PowerPC or x86 Hardware PowerPC or x86 Hardware (Linux & Apps Under CVM) (Linux & Apps Under PowerPC/x86)

17 What are future implications? Linux Apps Compiled for CVM Linux Compiled for CVM Linux Apps Compiled for PowerPC or x86 Linux Compiled for PowerPC or (diff. for x86) CVM CVM for PowerPC or x86 PowerPC or x86 Hardware PowerPC or x86 Hardware (Linux & Apps Under CVM) (Linux & Apps Under PowerPC/x86) Using CVM, & object code in XML format, OS can be booted from the web

18 DAISY, Crusoe, Dynamo, LaTTe DAISY

19 PowerPC IBM DAISY L3 Cache DAISY VLIW DAISY 6xx Bus Memory Controller PCI Bus DAISY Flash ROM PowerPC Flash ROM Disk Video Network Keyboard Memory PowerPC DAISY. Most DAISY work uses PowerPC as the source architecture. By allowing operations from multiple paths in its groups, DAISY is not dependent on good branch prediction, and indeed makes forward progress along all possible paths.

30x yes DAISY Execute Group s VLIW Translation Translate and Sched.

20 PowerPC IBM DAISY Interpret, Add to Group X no Stopping Point yes Y = X X= New Group no Previous Translate d Entry Point yes no Interp. 30x yes DAISY Execute Group s VLIW Translation Translate and Sched. Group Y to VLIW Instruct. no Freq. Exe Code yes > 24 Ops ILP > 3 ILP > 10 > 180 Cps yes yes At Good Stop yes yes yes Good Stop Point.: such as a Loopback

21 PowerPC IBM DAISY DAISY renaming Scratchpad registers r36 to r63 LS telescoping (allows dependence chains to be significantly shortened)

22 PowerPC x86 TRANSMETA CRUSOE When a translated group completes, the contents of all working x86 registers are copied at once to the shadow register set. This shadow copy allows Crusoe to efficiently recover from exceptions. x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware DAISY = 3 4 PowerPC instructions per cycle Transmeta claims the 667-MHz Crusoe TM5400 = 500-MHz Pentium III Different Intended Use Daisy Crusoe type Big Machine Small/ Mobile IPC Up to 4 atoms (ops) per molecule 128-bit Molecule FADD Floatin g Point Unit ADD Intege r ALU LD Load Store Unit In order pipeline BRCC Branc hunit CRUSOE Total Memory Gigs MB 64 Registers (of Crusoe processor block d.) Self/ Translations 100 MB+ 16 MB Working Shadow x86 Architectural State Crusoe Working Registers

PowerPC x86 TRANSMETA CRUSOE x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware 8 kb of on-chip local memory for data and 8 kb for instructions.

23 PowerPC x86 TRANSMETA CRUSOE x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware 8 kb of on-chip local memory for data and 8 kb for instructions. (So CMS can remain for quick access without disturbing x86 code and data.) 16-way associativity of the L1 DCache minimizes conflicts of x86 code data and that used by CMS. CRUSOE Memory controller is integrated into the TM5400 because it is part of the standard PC architecture. Unlike DAISY, has optimizations like strength reduction and aggressive dead code elimination. Alias Butler AB TM5400 Crusoe chip The TM5400 has less (7 million) transistors compared to AMD and Intel microprocessors, showing that BT allows = reduction in hardware complexity. Idp (X) (Speculative Load) Addr Size (Store Under Alias Mask) stam (Y) No : Continue Execution Hardware Aliasing Yes : Raise Exception to Crusoe VMM

24 x86 PowerPC HP DYNAMO HPUX Applications Problematic where it exposes Dynamo to the application = Potential + = Dynamo s translations last only 1 program invocation = Dynamo Software HPUX PA-RISC DYNAMO Potential + = Runs in the VA space of a single HPUX process = x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware

25 x86 PowerPC HP DYNAMO HPUX Applications Dynamo Software HPUX Problematic where it exposes Dynamo to the application = Less Transparent PA-RISC DYNAMO Potential + = Dynamo s translations last only 1 program invocation = Daisy & Crusoe last over many invocations of a program Potential + = Runs in the VA space of a single HPUX process = No need to deal with translation of addresses or grps spanning multiple pages= Reduces # synchronous exceptions to be done correctly = Allows more aggressive code optimizations x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware

26 x86 PowerPC HP DYNAMO HPUX Applications DAISY vs. Some of Dynamo s Optimizations : copy propagation constant propagation strength reduction loop invariant code motion loop unrolling DYNAMO Dynamo Software HPUX PA-RISC translates Groups as paths OR trees If OHead $ > optimization is helping, bails & exe original code directly ~ translation If OHead $ < optimization is helping, continues with VMM Groups are path, never trees or other forms= limit code explosion = cost of losses in exploiting parallelism DYNAMO x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware

27 x86 PowerPC HP DYNAMO HPUX Applications Dynamo Software HPUX PA-RISC DYNAMO x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware

28 x86 PowerPC SPARC IMB LaTTe Java JIT compilers, such as LaTTe, use dynamic translation and optimization to move from Java Virtual Machine code to RISC code. [3] Bytecode Bytecode CFG of Pseudo CFG of Real Code Native SPARC Bytecode Translation: Java stack is mapped to symbolic registers Register Allocation & Optimizations: Symbolic registers are allocated to machine registers Code Emission: Binary image is generated from the CFG Determines locations of basic blocks CFG of Pseudo SPARC Code CFG of Real SPARC Code Native SPARC Code LaTTe x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC

29 x86 PowerPC SPARC IMB LaTTe LaTTe and Java performance advantages over BT for traditional architectures: Bytecode CFG of Pseudo CFG of Real Code Lightweight monitor optimized for single threaded programs Native SPARC Efficient exception handling. Instead of inserting code, it uses hardware generated signals to detect exceptions such as out-ofbounds memory accesses LaTTe Efficient garbage collection, memory management, and sophisticated JIT compilation techniques Converts virtual method calls to direct method calls or inlines them by including a specific (conditional branch) check for the most frequently occurring method invoked from a particular call site Register allocation converts the JVM s stackbased model with push and pop operations to the register based model used in RISC machines such as Sparc. x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC

30 x86 PowerPC SPARC IMB LaTTe LaTTe and Java performance advantages over BT for traditional architectures: ( Difference: This uses BT to improve VM performance and in a way mitigate the gap between the source architecture JVM and the underlying target architecture ) Bytecode CFG of Pseudo CFG of Real Code Native SPARC A A A LaTTe B D C B D C D B D C (Tree Regions= its optimization unit) a) Original Control Flow Graph b) CFG of DAISY transformed code c) CFG of LaTTe transformed code Seeks tree groups in input code, instead of output w/ two-pass register allocation (first backward sweep to covey upward the info. for all symbolic registers live at its group exits, then forward sweep for actual register allocation based on hints from the backward sweep) x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC

31 PowerPC Additional Questions Asked: x86 SPARC 1. Can a binary translation machine have generally better performance than a well-designed superscalar? 2. Can all real-time problems be avoided? 3. What memory management schemes are best over a wide range of TCache sizes? 4. In full system BT, how can VMM memory amount change after startup? 5. If the VMM gives memory back to the system because of an unusually large working set for the TCache it has no way to transparently steal the memory from the OS running above it 6. Should the target architecture ever be exposed for users to access directly bypassing the source architecture layer and translation by the VMM? Afterthought Important considerations: operation semantics data formats (cond. code FP, etc) address translation special purpose registers more registers than the sourcearch (for VMM scratchpad) x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC Bytecode CFG of Pseudo CFG of Real Code Native SPARC

32 PowerPC FPGA/ Warp Processing (2008): x86 SPARC Using desktop, server, and scientific-computing applications, results gave similar speedups: ex. compared to a four-processor 400-MHz ARM11 system, warp processing obtained average speedups of 169X Recent Work x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC Bytecode CFG of Pseudo CFG of Real Code Native SPARC

33 PowerPC MTCrossBit (2011, from CrossBit) x86 SPARC A dynamic binary translation system based on multithreaded optimization Existing DBT techniques employed are for a single-threaded executive environment (increase the complexity of the hardware or runtime overhead) This is a multithreaded DBT framework with no associated hardware (uses a helper thread for building a hot trace which reduces overhead) Main and Helper threads use different cores to use multi-core resources efficiently Two methods: 1. the dual-special-parallel translation caches and 2. the new lock-free threads communication mechanism assembly language communication (ASLC) Recent Work Supported guest platforms including SimpleScalar, IA32, MIPS, SPARC, fully supported the IA32 host platform, PowerPC, etc x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC Bytecode CFG of Pseudo CFG of Real Code Native SPARC

PowerPC MTCrossBit (2011, framework) x86 SPARC MTCrossBit builds a hot trace that is concurrently executed in the helper thread to boost the performance, opposed to sequential execution in a single

34 PowerPC MTCrossBit (2011, framework) x86 SPARC MTCrossBit builds a hot trace that is concurrently executed in the helper thread to boost the performance, opposed to sequential execution in a single thread executive environment. However, multi-threaded framework has unavoidable problems such as mutual exclusion, the access of translated basic blocks, and communication between threads. Recent Work x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC Bytecode CFG of Pseudo CFG of Real Code Native SPARC

35 PowerPC CPU framework into GPUs () x86 SPARC Low overhead always has been an issue with DBT GPU (multi-core processor, used as a co-processor) can parallel execute the hot spot of binary code can reduce the overhead of DBT One solution is to construct the virtual execution environment to accelerate the process of DBT on CPU/GPU based architectures Hot spots of binary code and their related information, the framework converts the sequential code into PTX form and executes them on GPUs Recent Work No need to rewrite the source code, and the binary compatibility issues between different GPUs are also resolved, this usually has 10X speedup compared to X86 native platforms, and better with larger input x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC Bytecode CFG of Pseudo CFG of Real Code Native SPARC

Framework: x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX

36 PowerPC CPU framework into GPUs () x86 SPARC Recent Work Workflow of GXBit, First and Second Exe Pahases(extracts hotspots and converts to GPU form): The Translation Framework: x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC Bytecode CFG of Pseudo CFG of Real Code Native SPARC

37 PowerPC Sources Consulted: x86 SPARC [1] Altman et al., Advances and Future Challenges in Binary Translation and Optimization, Proceedings of the IEEE, vol. 89, No. 11, November [2] K. Diefendorff, Power4 focuses on memory bandwidth, Microprocessor: Rep., vol. 13, October [3] Dynamic Binary Translation and Optimization, December 2000: < micro33/tutorial/tutorial.html>, < [4] Vahid, F; Stitt, G; Lvsecky, R et al., Warp Processing: Dynamic Translation of Binaries to FPGA Circuits, Computer, vol. 41, Issue. 7, [5] Guan HaiBing et al., MTCrossBit: A dynamic binary translation system based on multithreaded optimization, Science China, vol. 54, No. 10, October Bibliography [6] Erzhou Zhu, Haibing Guan, Guoxing Dong, Yindong Yang, Hongbo Yang, A Translation Framework for Executing the Sequential Binary Code on CPU/GPU Based Architectures, Journal of Software, vol. 6, No. 12, December x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC Bytecode CFG of Pseudo CFG of Real Code Native SPARC

38 x86 SPARC PowerPC Questions/Comments?: Introduction Q U E S T I O N S? Purpose Negatives Positives CVM DAISY CRUSOE DYNAMO LaTTe Afterthought Recent Work Bibliography x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC Bytecode CFG of Pseudo CFG of Real Code Native SPARC

Crusoe Reference. What is Binary Translation. What is so hard about it? Thinking Outside the Box The Transmeta Crusoe Processor

Crusoe Reference. What is Binary Translation. What is so hard about it? Thinking Outside the Box The Transmeta Crusoe Processor Crusoe Reference Thinking Outside the Box The Transmeta Crusoe Processor 55:132/22C:160 High Performance Computer Architecture The Technology Behind Crusoe Processors--Low-power -Compatible Processors