Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 1 / 1

Similar documents
Fast and Accurate Simulation using the LLVM Compiler Framework

Just-In-Time Compilation

Ultra Fast Cycle-Accurate Compiled Emulation of Inorder Pipelined Architectures

Enhancing Energy Efficiency of Processor-Based Embedded Systems thorough Post-Fabrication ISA Extension

Ultra Fast Cycle-Accurate Compiled Emulation of Inorder Pipelined Architectures

Leveraging Predicated Execution for Multimedia Processing

Energy Consumption Evaluation of an Adaptive Extensible Processor

Just-In-Time Compilers & Runtime Optimizers

SABLEJIT: A Retargetable Just-In-Time Compiler for a Portable Virtual Machine p. 1

Exploiting Hardware Resources: Register Assignment across Method Boundaries

Trace Compilation. Christian Wimmer September 2009

Undergraduate Compilers in a Day

BEAMJIT, a Maze of Twisty Little Traces

Evaluating Heuristic Optimization Phase Order Search Algorithms

History of Compilers The term

Topic 9: Control Flow

What do Compilers Produce?

CS 351 Final Exam Solutions

Intermediate Code & Local Optimizations

Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors

Is dynamic compilation possible for embedded system?

Intermediate Representations

BEAMJIT: An LLVM based just-in-time compiler for Erlang. Frej Drejhammar

Sista: Improving Cog s JIT performance. Clément Béra

An Investigation into Value Profiling and its Applications

Lecture Compiler Backend

Adaptive Multi-Level Compilation in a Trace-based Java JIT Compiler

VM instruction formats. Bytecode translator

Compilers and Interpreters

CS 406/534 Compiler Construction Putting It All Together

What the CPU Sees Basic Flow Control Conditional Flow Control Structured Flow Control Functions and Scope. C Flow Control.

Trace-based JIT Compilation

Interaction of JVM with x86, Sparc and MIPS

A Trace-based Java JIT Compiler Retrofitted from a Method-based Compiler

9/5/17. The Design and Implementation of Programming Languages. Compilation. Interpretation. Compilation vs. Interpretation. Hybrid Implementation

A Dynamic Binary Translation System in a Client/Server Environment

point in worrying about performance. The goal of our work is to show that this is not true. This paper is organised as follows. In section 2 we introd

Virtual Machines and Dynamic Translation: Implementing ISAs in Software

Improving Data Access Efficiency by Using Context-Aware Loads and Stores

CS553 Lecture Profile-Guided Optimizations 3

Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization

In Search of Near-Optimal Optimization Phase Orderings

Automatic Instrumentation of Embedded Software for High Level Hardware/Software Co-Simulation

Model-based Software Development

Running class Timing on Java HotSpot VM, 1

Automatic Instrumentation Technique of Embedded Software for High Level Hardware/Software Co-Simulation

Compiler Design Spring 2018

FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES DEEP: DEPENDENCY ELIMINATION USING EARLY PREDICTIONS LUIS G. PENAGOS

What is a compiler? Xiaokang Qiu Purdue University. August 21, 2017 ECE 573

The basic operations defined on a symbol table include: free to remove all entries and free the storage of a symbol table

When do We Run a Compiler?

Register Allocation. CS 502 Lecture 14 11/25/08

Truffle A language implementation framework

Compiling Techniques

For our next chapter, we will discuss the emulation process which is an integral part of virtual machines.

Outline. Lecture 17: Putting it all together. Example (input program) How to make the computer understand? Example (Output assembly code) Fall 2002

Design & Implementation Overview

Group B Assignment 9. Code generation using DAG. Title of Assignment: Problem Definition: Code generation using DAG / labeled tree.

Compilers and Code Optimization EDOARDO FUSELLA

Compiler construction 2009

JOP: A Java Optimized Processor for Embedded Real-Time Systems. Martin Schoeberl

CSE 501: Compiler Construction. Course outline. Goals for language implementation. Why study compilers? Models of compilation

Integrated Modulo Scheduling and Cluster Assignment for TMS320C64x+ Architecture 1

Where We Are. Lexical Analysis. Syntax Analysis. IR Generation. IR Optimization. Code Generation. Machine Code. Optimization.

LECTURE 2. Compilers and Interpreters

Advanced Computer Architecture

Functions and Procedures

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures

INTRODUCTION TO LLVM Bo Wang SA 2016 Fall

Incremental Flow Analysis. Andreas Krall and Thomas Berger. Institut fur Computersprachen. Technische Universitat Wien. Argentinierstrae 8

Instructors: Randy H. Katz David A. PaKerson hkp://inst.eecs.berkeley.edu/~cs61c/fa10. Fall Lecture #11. Agenda

Introduction. CS 2210 Compiler Design Wonsun Ahn

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

Memory Access Optimizations in Instruction-Set Simulators

Lecture 3 Overview of the LLVM Compiler

Crusoe Reference. What is Binary Translation. What is so hard about it? Thinking Outside the Box The Transmeta Crusoe Processor

Compilers. History of Compilers. A compiler allows programmers to ignore the machine-dependent details of programming.

ECE 5775 (Fall 17) High-Level Digital Design Automation. Static Single Assignment

CSE 401/M501 Compilers

LESSON 13: LANGUAGE TRANSLATION

ECE331: Hardware Organization and Design

Kasper Lund, Software engineer at Google. Crankshaft. Turbocharging the next generation of web applications

Apple LLVM GPU Compiler: Embedded Dragons. Charu Chandrasekaran, Apple Marcello Maggioni, Apple

Application-Specific Design of Low Power Instruction Cache Hierarchy for Embedded Processors

Precept 2: Non-preemptive Scheduler. COS 318: Fall 2018

Incremental Dynamic Code Generation with Trace Trees

Emulation. Michael Jantz

A Feasibility Study for Methods of Effective Memoization Optimization

A Method-Based Ahead-of-Time Compiler For Android Applications

Building a Compiler with. JoeQ. Outline of this lecture. Building a compiler: what pieces we need? AKA, how to solve Homework 2

Accelerating Stateflow With LLVM

Mechanisms and constructs for System Virtualization

Introduction Contech s Task Graph Representation Parallel Program Instrumentation (Break) Analysis and Usage of a Contech Task Graph Hands-on

RetDec: An Open-Source Machine-Code Decompiler. Jakub Křoustek Peter Matula

LLVM Tutorial. John Criswell University of Rochester

Java Performance: The Definitive Guide

Hardware Emulation and Virtual Machines

The Software Stack: From Assembly Language to Machine Code

Lecture 2 Overview of the LLVM Compiler

Swift: A Register-based JIT Compiler for Embedded JVMs

Transcription:

Dynamic Binary Translation for Generation of Cycle Accurate Architecture Simulators Institut für Computersprachen Technische Universtät Wien Austria Andreas Fellnhofer Andreas Krall David Riegler Part of this work was supported by OnDemand Microelectronics and the Christian Doppler Forschungsgesellschaft Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 1 / 1

Overview Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 2 / 1

Previous Projects Previous Projects incremental compilers static assembly language instrumentation Molecule (ATOM clone) STonX CACAO bintrans reverse compilation (DSP VLIW to C) compiled cycle accurate instruction set simulation iboy Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 3 / 1

Previous Projects STonX AtariST on X windows generated 68k instruction set emulator work on binary translation unfinished Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 4 / 1

Previous Projects CACAO JIT-only Java Virtual Machine ultra-fast basic compiler recompilation with optimizations on-stack replacement deoptimization when assumptions become invalid Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 5 / 1

Previous Projects bintrans generator for user mode binary translators LISP like source and target architecture specification direct translation of blocks/traces without intermediate representation hybrid fixed register mapping and local register allocation local register liveness analysis with global propagation between different runs 1.8 to 2.5 overhead compared to native code Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 6 / 1

Previous Projects iboy gameboy emulator for ipod full system level cycle accurate emulation uses only dynamic binary translation self modifying code leads to recompilation of basic block template based generated compiler local flag constant propagation and liveness analysis ROM/RAM code caches Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 7 / 1

Overview Simulator Specification Processor Architecture Description processor.xml adlgen tool VHDL Generator Common xadl Compiler Generator Processing Simulator Generator VHDL Hardware Compiler decoder.c simulator.c Preprocessed Description Simulator Generator Instruction 1 Pipeline Stage 1... Instruction 2 Pipeline Stage 1 Pipeline Stage 2 Pipeline Stage 3 Pipeline Stage 4 Instruction N Pipeline Stage 1.... read input port a read input port b read input port c...... add a, b... Memory-Element Pool a b Sequence Element Sequence Element...... Sequence Element... C Variable Definitions Definition of a Definition of b... C Function Instruction 2, Stage 3 C Statement Variable Usage C Statement Variable Usage C Statement Variable Usage... Instruction Set List of Built-in Calls Internal Description C Simulator Source Build Description Emit Code Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 8 / 1

Simulator Specification Architecture Description Language mostly structural xml syntax graphical user interface available no redundant information MIPS R2000 specification is about 1000 lines Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 9 / 1

Simulator Specification Architecture Description Example <Operation name="addu" syntax="op3 s" > <Syntax syntax="op3 s" token="op" value="addu" /> <Body> <add a="rs i" b="rt i" d="rd o" o="overflow" c="carry"/> </Body> </Operation> Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 10 / 1

Simulator Implementation Simulator Basics generated mixed interpreting/translating simulator translation of blocks and traces common IR for interpreter and translator backend is LLVM just-in-time compiler own and LLVM optimizations used Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 11 / 1

Simulator Implementation Differences to Standard Binary Translation cycle accurate simulator full system simulation simulation of in-order pipelined architectures instructions cross basic block borders Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 12 / 1

Simulator Implementation Overlapping Instructions stage number 1 2 3 4 block A block B cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 cycle 8 A4 inst 1 jump inst 3 B1 B2 B3 B4 A3 A4 inst 1 jump inst 3 B1 B2 B3 A2 A3 A4 inst 1 jump inst 3 B1 B2 A1 A2 A3 A4 inst 1 jump inst 3 B1 instructions executed in block A splitted instruction 1 splitted jump instruction 2 (modifiy PC in stage 2) splitted instruction 3 instructions executed in block B Instructions which depend on the predecessor basic block Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 13 / 1

Simulator Implementation Similar Predecessor Blocks program memory 0x100: add 0x101: add 0x102: store 0x103: sub 0x104: cbranch 0x101 0x105: store 0x106: add basic block 1 0x100: add 0x101: add 0x102: store 0x103: sub 0x104: cbranch 0x101 basic block 3 0x102 0x103 0x104 0x105: store 0x106: add basic block 2 0x102 0x103 0x104 0x102: store 0x103: sub 0x104: cbranch 0x101 Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 14 / 1

Simulator Implementation Basic Block Duplication program memory 0x100: add 0x101: add 0x102: store 0x103: sub 0x104: cbranch 0x101 0x105: store 0x106: add basic block 1 0x100: add 0x101: add 0x102: store 0x103: sub 0x104: cbranch 0x101 basic block 3a 0x101 0x102 0x103 0x104 0x105: store basic block 2a 0x101 0x102 0x103 0x104 0x102: store 0x103: sub 0x104: cbranch 0x101 basic block 3b 0x104 0x102 0x103 0x104 0x105: store basic block 2b 0x104 0x102 0x103 0x104 0x102: store 0x103: sub 0x104: cbranch 0x101 0x106: add 0x106: add Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 15 / 1

Trace Formation Simulator Implementation entry point of trace Trace 1 basic block 1 bb1 pipeline restore info successor array 0 1 bb2 bb3 basic block 2 bb2 pipeline restore info basic block 3 successor array 0 1 bb4 bb2 basic block 4 bb4 pipeline restore info successor array bb3 pipeline restore info successor array 0 bb4 exit point for basic block 1 exit points for basic blocks 2 to 4 Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 16 / 1

Simulator Implementation Optimizations LLVM optimizations (e.g. constant propagation, dead store elimination) local copy of global values linking of basic blocks constant forward optimization LLVM JIT very slow (mostly instruction selection) Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 17 / 1

Empirical Evaluation Simulation Speed MIPS 1000.0MHz 100.0MHz 21.9MHz 10.9MHz 10.0MHz 195.1MHz 111.2MHz 66.0MHz 10.4MHz 4.9MHz 43.8MHz 10.2MHz Translator Interpreter 4.2MHz 43.6MHz 1.0MHz 1.0MHz prime jpeg crc32 sha dijkstra bitcount blowfishstringsearch adpcm gsm rijndael AVG Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 18 / 1

Empirical Evaluation Simulation Speed CHILI 1000.0Mhz 100.0Mhz 77.5Mhz 482.4Mhz 99.5Mhz 111.8Mhz 73.4Mhz Translator Interpreter 79.6Mhz 10.0Mhz 3.9Mhz 8.2Mhz 3.4Mhz 10.5Mhz 4.1Mhz 1.0Mhz 0.4Mhz prime jpeg crc32 sha dijkstra bitcount blowfishstringsearch adpcm gsm rijndael AVG Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 19 / 1

Empirical Evaluation Breakdown of Simulation Cycles MIPS 100% 80% 60% 40% 20% 0% Trace BasicBlock Interpreter prime jpeg crc32 sha bitcount stringsearch gsm dijkstra blowfish adpcm rijndael Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 20 / 1

Empirical Evaluation Breakdown of Simulation Cycles CHILI 100% 80% 60% 40% 20% 0% Trace BasicBlock Interpreter prime jpeg crc32 sha bitcount stringsearch gsm dijkstra blowfish adpcm rijndael Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 21 / 1

Empirical Evaluation Compile and overall Run Time MIPS 16s 14s 12s 10s 8s 6s 4s 2s 0s prime Compiler Runtime jpeg crc32 sha bitcount stringsearch gsm dijkstra blowfish adpcm rijndael Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 22 / 1

Empirical Evaluation Compile and overall Run Time CHILI 16s 14s 12s 10s 8s 6s 4s 2s 0s prime Compiler Runtime jpeg crc32 sha bitcount stringsearch gsm dijkstra blowfish adpcm rijndael Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 23 / 1

Empirical Evaluation Performance with increasing Simulation Time CHILI 1000MHz 100MHz 10MHz jpeg crc32 blowfish 1MHz 10k cycles 100k cycles 1M cylces 10M cycles 100M cycles 1000M cycles 10000M cycles Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 24 / 1

Empirical Evaluation Simulation Speed with different Compilation Thresholds Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 25 / 1

Empirical Evaluation Optimized Block Linkage MIPS 250% 200% 150% 100% Trace Block Link 50% 0% prime jpeg crc32 sha dijkstra bitcount blowfish stringsearch adpcm gsm rijndael Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 26 / 1

Empirical Evaluation Optimized Block Linkage CHILI 300% 250% 200% 150% Trace Block Link 100% 50% 0% prime jpeg crc32 sha dijkstra bitcount blowfish stringsearch adpcm gsm rijndael Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 27 / 1

Empirical Evaluation Local Copy of Global Values MIPS 450% 400% 393% 350% 354% 300% 250% 200% 150% 100% 173% 229% 123% 125% 152% 174% 241% 140% 125% 109% 147% 170% 181% 209% 130% 133% 192% 119% 83% 142% Local Scope Opt. Local Scope 50% 0% prime jpeg crc32 sha dijkstra bitcount blowfish stringsearch adpcm gsm rijndael Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 28 / 1

Empirical Evaluation Local Copy of Global Values CHILI 600% 500% 521% 400% 300% 312% 351% 250% 253% Local Scope Opt. Local Scope 200% 100% 145% 133% 125% 176% 159% 142% 99% 100% 123% 139% 153% 123% 152% 163% 123% 94% 170% 0% prime jpeg crc32 sha dijkstra bitcount blowfish stringsearch adpcm gsm rijndael Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 29 / 1

Empirical Evaluation Forwarding Optimization MIPS 700% 600% 611% 500% 400% 401% Constant Forwards 300% 284% 228% 200% 100% 138% 182% 160% 166% 161% 111% 146% 146% 0% prime jpeg crc32 sha dijkstra bitcount blowfish stringsearch adpcm gsm rijndael AVG Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 30 / 1

Conclusion and Further Work Conclusion cycle accurate simulation rises additional problems binary translation is efficient LLVM JIT is slow Dynamic Binary Translation for Generation of Cycle Accurate October Architecture 28, 2008 Simulators 31 / 1