Crusoe Reference. What is Binary Translation. What is so hard about it? Thinking Outside the Box The Transmeta Crusoe Processor

Similar documents
Virtual Machines and Dynamic Translation: Implementing ISAs in Software

Advanced Computer Architecture

Lecture 12. Motivation. Designing for Low Power: Approaches. Architectures for Low Power: Transmeta s Crusoe Processor

CS 252 Graduate Computer Architecture. Lecture 15: Virtual Machines

CS 152 Computer Architecture and Engineering. Lecture 22: Virtual Machines

C 1. Last time. CSE 490/590 Computer Architecture. Virtual Machines I. Types of Virtual Machine (VM) Outline. User Virtual Machine = ISA + Environment

Dynamic Translation for EPIC Architectures

IA-64, P4 HT and Crusoe Architectures Ch 15

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Intel released new technology call P6P

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Pentium IV-XEON. Computer architectures M

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Inherently Lower Complexity Architectures using Dynamic Optimization. Michael Gschwind Erik Altman

RISC Architecture Ch 12

5008: Computer Architecture

Advanced processor designs

Case Study : Transmeta s Crusoe

administrivia final hour exam next Wednesday covers assembly language like hw and worksheets

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

COMPUTER ORGANIZATION AND DESI

Superscalar Processors

Computer System Architecture Final Examination

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Getting CPI under 1: Outline

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Machines. Characteristics of superscalar processors

Processor (IV) - advanced ILP. Hwansoo Han

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

RISC I from Berkeley. 44k Transistors 1Mhz 77mm^2

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Advances and Future Challenges in Binary Translation and Optimization

ROB: head/tail. exercise: result of processing rest? 2. rename map (for next rename) log. phys. free list: X11, X3. PC log. reg prev.

Superscalar Processors

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture

EITF20: Computer Architecture Part2.2.1: Pipeline-1

ECE 486/586. Computer Architecture. Lecture # 7

Hardware Speculation Support

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Chapter 4. The Processor

Multiple Instruction Issue. Superscalars

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

CS425 Computer Systems Architecture

CS252 Spring 2017 Graduate Computer Architecture. Lecture 18: Virtual Machines

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Multiple Issue ILP Processors. Summary of discussions

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Lec 25: Parallel Processors. Announcements

E0-243: Computer Architecture

CS 152, Spring 2012 Section 8

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Crusoe Processor Products and Technology

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

The Processor: Instruction-Level Parallelism

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Handout 2 ILP: Part B

Keywords and Review Questions

Outline. What Makes a Good ISA? Programmability. Implementability

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

Multithreaded Processors. Department of Electrical Engineering Stanford University

CS 152 Computer Architecture and Engineering. Lecture 21: Virtual Machines. Outline. NOW Handout Page 1

Instruction Level Parallelism

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

CS 152, Spring 2011 Section 8

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

The Pentium II/III Processor Compiler on a Chip

The P6 Architecture: Background Information for Developers

Multicycle Approach. Designing MIPS Processor

Precise Exceptions and Out-of-Order Execution. Samira Khan

Advanced issues in pipelining

Dynamic Control Hazard Avoidance

UNIT- 5. Chapter 12 Processor Structure and Function

Computer Organization & Assembly Language Programming

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

Representation of Information

CS429: Computer Organization and Architecture

Lecture 1 An Overview of High-Performance Computer Architecture. Automobile Factory (note: non-animated version)

History of the Intel 80x86

Case Study IBM PowerPC 620

Last time: forwarding/stalls. CS 6354: Branch Prediction (con t) / Multiple Issue. Why bimodal: loops. Last time: scheduling to avoid stalls

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

Computer Science 146. Computer Architecture

CPU Pipelining Issues

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

EITF20: Computer Architecture Part2.2.1: Pipeline-1

High-Performance Processors Design Choices

Transcription:

Crusoe Reference Thinking Outside the Box The Transmeta Crusoe Processor 55:132/22C:160 High Performance Computer Architecture The Technology Behind Crusoe Processors--Low-power -Compatible Processors Implemented with Code Morphing Software, Alexander Klaiber, Transmeta Corporation, January, 2000. http://transmeta.com/pdfs/paper_aklaiber_19jan00.pdf What is Binary Translation Taking a binary executable from a source ISA and generate a new executable in a target ISA such that the new executable has exactly the same functional behavior as the original Same ISA Optimization compiler instruction scheduling is a restricted form of translation re-optimizing old binaries for new, but ISA-compatible, hardware Reoptimization can improve performance regardless whether implementation details are exposed by the ISA Across ISAs Overcoming binary compatibility two processors are binary compatible if they can run the same set of binaries (from BIOS to OS to applications) Strong economic incentive How to get all of the popular software to run on my new processor? How to get my software to run on all of the popular processors? What is so hard about it? It is always possible to interpret an executable from any ISA on a machine of any ISA But, naïve interpreters incur a lot of overhead and thus run slower and use more memory Binary translation is not interpretation emits new binaries that runs natively on the target ISA processor can be very difficult if the source ISA (e.g ) or the source executable (e.g. hand-crafted assembly code) is not nice Without the high-level source code, you can t always statically tell what an executable is going to do 1

Some Hard Problems in Translation Day-to-day problems Floating-point representation and operations Precise exceptions and interrupts More obscure problems (Executables compiled from high-level languages tend not to have these kind of problems) Self-modifying code A program can construct an instruction (in the old format) as a data word, store it to memory, and jump to it Self-referential code A program can checksum the code segment and compare it against a stored value based on the original executable Register Indirect jumps to computed addresses A program might compute a jump target that is only appropriate for the original binary format and layout A program can jump to the middle of an inst on purpose Static vs. Dynamic Translation Static + May have source information (or at least have object code) + Can spend as much time as you need (days to months) - Isn t always safe or possible - Not transparent to users Dynamic - Translation time is part of program execution time Can t do very complex analysis / optimization Infrequently used code sections cost as much to translate as frequently used code sections - No source-level information + Has runtime information (dynamic profiling and optimization) + Can fall back to interpretation if all else fails + Can be completely transparent to users How can binary translation be used? Porting old software to new platforms (static, different-isa) e.g translator from DEC VAX to Alpha Binary Augmentations (static, same-isa) localized modifications to shrink-wrap binaries without sources e.g. inserting profiling code, simple optimizations Dynamic Code Optimizations (dynamic, same-isa) profile an execution and dynamically modify the executable using techniques such as trace scheduling, e.g. HP Dynamo Cross-platform execution (dynamic, different-isa) using a combination of interpretation and translation to very efficiently emulate a different (often nasty) ISA e.g. Transmeta Crusoe and Code Morphing Efficient Virtual Machines (dynamic, different-isa) using a combination of interpretation and translation to very efficiently emulate a different (nice-by-design) ISA e.g. Java virtual machines and JIT (Just-in-Time) compilation A New Way to Think about Architecture Architecture = dyn. translation + hardware implementation no problem of forward or backward binary compatibility backward compatible processor: don t need new software forward compatible processor: don t need new processors don t need increasingly fancy HW to speedup an old ISA both the translator and HW can be upgraded or repaired with very little disruption to the users Processors (and systems) become commodity items (like DRAM) processors can become very simple but very fast slightly defective processors can still be sold with workarounds Old platforms and software can be cost-effectively revived and maintained forever 2

Transmeta Crusoe & Code Morphing Complete Abstraction Native SW HW applications OS BIOS Code Morphing Dynamic Binary Translation Crusoe VLIW Processor Crusoe boots Code Morpher from ROM at power-up Crusoe+Code Morphing == processor software (including BIOS) cannot tell the difference Crusoe VLIW Processor 128-bit molecule FADD ADD LD BRCC FPU 10-stage ALU #0 7-stage Load/Store Unit Branch 64 or 128-bit molecules directly control the in-order VLIW pipeline (no dependence within a molecule) 1 FPU, 2 ALU, 1 LSU, and 1 BU 64 integer GPRs, 32 FPRs + shadow regs No hardware renaming or reordering Same cond. code, floating-point, and TLB format as Register Files Executing to as uops or atoms 64 temporary for Code Morphing Software & translated code Translate uop P4 uop Trace Cache Out-of-Order Dispatch Parallel FUs In-Order Retire check point restore Translation Cache shadow Code Morphing SW (translate & interpret) VLIW Dispatch VLIW FUs 3

Code Morphing Software (CMS) The only software written natively for Crusoe processors begins execution at power-up fetches previously unseen basic block from memory translates a block of instructions at a time into Crusoe VLIW caches the translation for future use jumps to the generated Crusoe code for execution, execution can continue directly into other blocks if translation is cached regains control when execution reaches a unknown basic block interprets the execution of unsafe instructions retranslates a block after collecting profiling information CMS uses a separate region of memory that cannot be touched by code translated from Crusoe processors do not need to be binary compatible between generations can make different design trade-offs but needs a new translator with a new processor Cost of Translation Translation time is part of execution time! Translation cost has to be amortized over repeat use 1st pass translation must be fast and safe almost like interpretation instructions are examined and translated byte-by-byte CMS constructs a function that is equivalent to the basic block CMS jumps to the function and regain control when the fxn returns collects statistics, i.e. execution frequency, branch histories Re-translate an often repeated basic block (after ~50 times) examines execution profile applies full-blown analysis and optimization builds inlined Crusoe code that can run directly out of the translation cache without intervention by CMS can do cross-basic block optimizations, such as speculative code motion and trace scheduling Caches translation for reuse to amortize translation cost Example of a Translation Binary Code A: addl %eax, (%esp) // load data from stack, add to %eax B: addl %ebx, (%esp) // load data from stack, add to %ebx C: movl %esi, (%ebp) // load from mem (%ebp) into %esi D: subl %ecx, 5 // subtract 5 from %ecx literal translation 1st Pass Sequential Crusoe Atoms ld %r30, [%esp] // A: load data from stack, save to temp add.c %eax, %eax, %r30 // add to %eax, set condition code ld %r31, [%esp] // B: load data from stack, save to temp add.c %ebx, %ebx, %r31 // add to %ebx, set condition code ld %esi, [%ebp] // C: load from mem (%ebp) into %esi sub.c %ecx, %ecx, 5 // D: subtract 5 from %ecx Example of an Optimization 1st Pass Sequential Crusoe Atoms ld %r30, [%esp] add.c %eax, %eax, %r30 // cc is never tested ld %r31, [%esp] // %r31 and %r30 are common sub-expr add.c %ebx, %ebx, %r31 // cc is never tested ld %esi, [%ebp] sub.c %ecx, %ecx, 5 basic optimizations 2nd Pass Optimized Crusoe Atoms ld %r30, [%esp] // [%esp] is loaded once and reused add %eax, %eax, %r30 // don t need to set condition code add %ebx, %ebx, %r30 // don t need to set condition code ld %esi, [%ebp] sub.c %ecx, %ecx, 5 Optimizations include common sub-expr elimination, dead-code elimination (include unnecessary cc), loop invariant removal, etc. (see L16 for more) 4

Example of Scheduling 2nd Pass Optimized Crusoe Atoms ld %r30, [%esp] add %eax, %eax, %r30 add %ebx, %ebx, %r30 ld %esi, [%ebp] sub.c %ecx, %ecx, 5 VLIW Scheduling Final Pass Scheduled Crusoe Molecules { ld %r30, [%esp] ; sub.c %ecx, %ecx, 5 } { ld %esi, [%ebp] ; add %eax, %eax, %r30 ; add %ebx, %ebx, %r30 } In-order execution of scheduled molecules on a Crusoe processor mimics the dynamic superscalar execution of uops in Pentium s Branch Prediction Static prediction based on dynamic profiling Translation can favor the more frequent traversed arm of an if-then-else statement by making that arm the fall through (not-taken) path Trace scheduling construct traces such that the most frequently traversed control flow paths encounters no branches at all enlarged scope of ILP scheduling needs compensation code when falling off trace select instruction SEL CC, Rd, Rs, Rt means if (CC) Rd=Rs else Rd=Rt a limited variant of predicated execution supports if-conversion, i.e change control-flow to data-flow Detecting Load/Store Aliasing st %data, [%y] ld %r31, [%x] use %r31 potential aliasing ldp %r31, [%x] stam %data, [%y] use %r31 ld-and-protect records the location and size of the load store-under-alias-mask checks aliasing against the region protected by ldp if stam discovers a conflict, it triggers an exception so CMS can discard the effects of this basic block and rerun a different translation that does not have the load and store reordered Eliminating Repeated Loads ld %r30, [%x] st %data, [%y] ld %r31, [%x] use %r31 ldp %r30, [%x] stam %data, [%y] use %r30 Due to limited number of ISA regs, programs keep most variables on the stack the same value is reloaded from stack for each use (there isn t a spare ISA register to hold it between use) CMS detects repeated load from the same address as common subexpression and holds a value in a temporary register for reuse A store in between the loads can make the optimization unsafe stam allows CMS to optimize for the common case 5

Precise Exception Handling CMS and Crusoe must emulate behavior exactly, including precise exception But, an instruction maps to several atoms and can be reordered with atoms of other instructions and can be dispersed over a large code block after optimization and scheduling Solution (assumes exceptions are rare) check point machine state at the start of every translated block if execution reaches the end of the block without exception then continue to the next block if exceptions is triggered in the middle of a block, CMS restores machine state from check point and reruns the same block by interpreting the original code, one instruction at a time Check Pointing Machine State Register File - a special commit instruction makes a copy of register contents in the shadow - shadow is not touched by commit program execution restore - restore restores the shadowed shadow values Temporary for Code Morphing Software & translated code Gated Store Buffer all stores are intercepted and held in a special buffer after a commit point, all earlier gated stores are released to update cache or memory as appropriate (Note: not all at once!!) If a restore event is triggered, the content of the gated store buffer is discarded - After a commit, any earlier effects cannot be undone - An restore returns machine state to the last commit point Performance of Transmeta s Execution Time Comparable to direct hardware implementation by Intel or AMD TM5400 at 667 MHz is about the same as a Pentium III running at 500MHz Unamortized translation cost leads to lower benchmark results Low Cost Much simpler hardware TM5400 is a about 7 million transistors (P4 is at 41 Million) Easier to design, more scalable, easier to reach high clock rate, more room for caches, better yield, etc Doesn t have to worry about binary compatibility!! Low Power less hardware lower power Additional power management features (such as variable supply voltage and clock frequency) Crusoe vs. Pentium Die Size Mobile PII Mobile PII Mobile PIII TM3120 TM5400 Process.25m.25m shrink.18m.22m.18m On-chip L1 Cache 32KB 32KB 32KB 96KB 128KB On-chip L2 Cache 0 256KB 256KB 0 256KB Die Size 130mm 2 180mm 2 106mm 2 77mm 2 73mm 2 6

Crusoe vs. Pentium: Heat 7