Dynamic Binary Translation for Generation of Cycle Accurate Architecture Simulators Institut für Computersprachen Technische Universtät Wien Austria Andreas Fellnhofer Andreas Krall David Riegler Part of this work was supported by OnDemand Microelectronics and the Christian Doppler Forschungsgesellschaft Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 1 / 1
Overview Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 2 / 1
Previous Projects Previous Projects incremental compilers static assembly language instrumentation Molecule (ATOM clone) STonX CACAO bintrans reverse compilation (DSP VLIW to C) compiled cycle accurate instruction set simulation iboy Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 3 / 1
Previous Projects STonX AtariST on X windows generated 68k instruction set emulator work on binary translation unfinished Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 4 / 1
Previous Projects CACAO JIT-only Java Virtual Machine ultra-fast basic compiler recompilation with optimizations on-stack replacement deoptimization when assumptions become invalid Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 5 / 1
Previous Projects bintrans generator for user mode binary translators LISP like source and target architecture specification direct translation of blocks/traces without intermediate representation hybrid fixed register mapping and local register allocation local register liveness analysis with global propagation between different runs 1.8 to 2.5 overhead compared to native code Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 6 / 1
Previous Projects iboy gameboy emulator for ipod full system level cycle accurate emulation uses only dynamic binary translation self modifying code leads to recompilation of basic block template based generated compiler local flag constant propagation and liveness analysis ROM/RAM code caches Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 7 / 1
Overview Simulator Specification Processor Architecture Description processor.xml adlgen tool VHDL Generator Common xadl Compiler Generator Processing Simulator Generator VHDL Hardware Compiler decoder.c simulator.c Preprocessed Description Simulator Generator Instruction 1 Pipeline Stage 1... Instruction 2 Pipeline Stage 1 Pipeline Stage 2 Pipeline Stage 3 Pipeline Stage 4 Instruction N Pipeline Stage 1.... read input port a read input port b read input port c...... add a, b... Memory-Element Pool a b Sequence Element Sequence Element...... Sequence Element... C Variable Definitions Definition of a Definition of b... C Function Instruction 2, Stage 3 C Statement Variable Usage C Statement Variable Usage C Statement Variable Usage... Instruction Set List of Built-in Calls Internal Description C Simulator Source Build Description Emit Code Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 8 / 1
Simulator Specification Architecture Description Language mostly structural xml syntax graphical user interface available no redundant information MIPS R2000 specification is about 1000 lines Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 9 / 1
Simulator Specification Architecture Description Example <Operation name="addu" syntax="op3 s" > <Syntax syntax="op3 s" token="op" value="addu" /> <Body> <add a="rs i" b="rt i" d="rd o" o="overflow" c="carry"/> </Body> </Operation> Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 10 / 1
Simulator Implementation Simulator Basics generated mixed interpreting/translating simulator translation of blocks and traces common IR for interpreter and translator backend is LLVM just-in-time compiler own and LLVM optimizations used Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 11 / 1
Simulator Implementation Differences to Standard Binary Translation cycle accurate simulator full system simulation simulation of in-order pipelined architectures instructions cross basic block borders Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 12 / 1
Simulator Implementation Overlapping Instructions stage number 1 2 3 4 block A block B cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 cycle 8 A4 inst 1 jump inst 3 B1 B2 B3 B4 A3 A4 inst 1 jump inst 3 B1 B2 B3 A2 A3 A4 inst 1 jump inst 3 B1 B2 A1 A2 A3 A4 inst 1 jump inst 3 B1 instructions executed in block A splitted instruction 1 splitted jump instruction 2 (modifiy PC in stage 2) splitted instruction 3 instructions executed in block B Instructions which depend on the predecessor basic block Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 13 / 1
Simulator Implementation Similar Predecessor Blocks program memory 0x100: add 0x101: add 0x102: store 0x103: sub 0x104: cbranch 0x101 0x105: store 0x106: add basic block 1 0x100: add 0x101: add 0x102: store 0x103: sub 0x104: cbranch 0x101 basic block 3 0x102 0x103 0x104 0x105: store 0x106: add basic block 2 0x102 0x103 0x104 0x102: store 0x103: sub 0x104: cbranch 0x101 Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 14 / 1
Simulator Implementation Basic Block Duplication program memory 0x100: add 0x101: add 0x102: store 0x103: sub 0x104: cbranch 0x101 0x105: store 0x106: add basic block 1 0x100: add 0x101: add 0x102: store 0x103: sub 0x104: cbranch 0x101 basic block 3a 0x101 0x102 0x103 0x104 0x105: store basic block 2a 0x101 0x102 0x103 0x104 0x102: store 0x103: sub 0x104: cbranch 0x101 basic block 3b 0x104 0x102 0x103 0x104 0x105: store basic block 2b 0x104 0x102 0x103 0x104 0x102: store 0x103: sub 0x104: cbranch 0x101 0x106: add 0x106: add Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 15 / 1
Trace Formation Simulator Implementation entry point of trace Trace 1 basic block 1 bb1 pipeline restore info successor array 0 1 bb2 bb3 basic block 2 bb2 pipeline restore info basic block 3 successor array 0 1 bb4 bb2 basic block 4 bb4 pipeline restore info successor array bb3 pipeline restore info successor array 0 bb4 exit point for basic block 1 exit points for basic blocks 2 to 4 Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 16 / 1
Simulator Implementation Optimizations LLVM optimizations (e.g. constant propagation, dead store elimination) local copy of global values linking of basic blocks constant forward optimization LLVM JIT very slow (mostly instruction selection) Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 17 / 1
Empirical Evaluation Simulation Speed MIPS 1000.0MHz 100.0MHz 21.9MHz 10.9MHz 10.0MHz 195.1MHz 111.2MHz 66.0MHz 10.4MHz 4.9MHz 43.8MHz 10.2MHz Translator Interpreter 4.2MHz 43.6MHz 1.0MHz 1.0MHz prime jpeg crc32 sha dijkstra bitcount blowfishstringsearch adpcm gsm rijndael AVG Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 18 / 1
Empirical Evaluation Simulation Speed CHILI 1000.0Mhz 100.0Mhz 77.5Mhz 482.4Mhz 99.5Mhz 111.8Mhz 73.4Mhz Translator Interpreter 79.6Mhz 10.0Mhz 3.9Mhz 8.2Mhz 3.4Mhz 10.5Mhz 4.1Mhz 1.0Mhz 0.4Mhz prime jpeg crc32 sha dijkstra bitcount blowfishstringsearch adpcm gsm rijndael AVG Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 19 / 1
Empirical Evaluation Breakdown of Simulation Cycles MIPS 100% 80% 60% 40% 20% 0% Trace BasicBlock Interpreter prime jpeg crc32 sha bitcount stringsearch gsm dijkstra blowfish adpcm rijndael Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 20 / 1
Empirical Evaluation Breakdown of Simulation Cycles CHILI 100% 80% 60% 40% 20% 0% Trace BasicBlock Interpreter prime jpeg crc32 sha bitcount stringsearch gsm dijkstra blowfish adpcm rijndael Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 21 / 1
Empirical Evaluation Compile and overall Run Time MIPS 16s 14s 12s 10s 8s 6s 4s 2s 0s prime Compiler Runtime jpeg crc32 sha bitcount stringsearch gsm dijkstra blowfish adpcm rijndael Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 22 / 1
Empirical Evaluation Compile and overall Run Time CHILI 16s 14s 12s 10s 8s 6s 4s 2s 0s prime Compiler Runtime jpeg crc32 sha bitcount stringsearch gsm dijkstra blowfish adpcm rijndael Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 23 / 1
Empirical Evaluation Performance with increasing Simulation Time CHILI 1000MHz 100MHz 10MHz jpeg crc32 blowfish 1MHz 10k cycles 100k cycles 1M cylces 10M cycles 100M cycles 1000M cycles 10000M cycles Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 24 / 1
Empirical Evaluation Simulation Speed with different Compilation Thresholds Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 25 / 1
Empirical Evaluation Optimized Block Linkage MIPS 250% 200% 150% 100% Trace Block Link 50% 0% prime jpeg crc32 sha dijkstra bitcount blowfish stringsearch adpcm gsm rijndael Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 26 / 1
Empirical Evaluation Optimized Block Linkage CHILI 300% 250% 200% 150% Trace Block Link 100% 50% 0% prime jpeg crc32 sha dijkstra bitcount blowfish stringsearch adpcm gsm rijndael Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 27 / 1
Empirical Evaluation Local Copy of Global Values MIPS 450% 400% 393% 350% 354% 300% 250% 200% 150% 100% 173% 229% 123% 125% 152% 174% 241% 140% 125% 109% 147% 170% 181% 209% 130% 133% 192% 119% 83% 142% Local Scope Opt. Local Scope 50% 0% prime jpeg crc32 sha dijkstra bitcount blowfish stringsearch adpcm gsm rijndael Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 28 / 1
Empirical Evaluation Local Copy of Global Values CHILI 600% 500% 521% 400% 300% 312% 351% 250% 253% Local Scope Opt. Local Scope 200% 100% 145% 133% 125% 176% 159% 142% 99% 100% 123% 139% 153% 123% 152% 163% 123% 94% 170% 0% prime jpeg crc32 sha dijkstra bitcount blowfish stringsearch adpcm gsm rijndael Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 29 / 1
Empirical Evaluation Forwarding Optimization MIPS 700% 600% 611% 500% 400% 401% Constant Forwards 300% 284% 228% 200% 100% 138% 182% 160% 166% 161% 111% 146% 146% 0% prime jpeg crc32 sha dijkstra bitcount blowfish stringsearch adpcm gsm rijndael AVG Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 30 / 1
Conclusion and Further Work Conclusion cycle accurate simulation rises additional problems binary translation is efficient LLVM JIT is slow Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 31 / 1