Dynamic Translation for EPIC Architectures

Similar documents
Lecture 12. Motivation. Designing for Low Power: Approaches. Architectures for Low Power: Transmeta s Crusoe Processor

Case Study : Transmeta s Crusoe

Crusoe Reference. What is Binary Translation. What is so hard about it? Thinking Outside the Box The Transmeta Crusoe Processor

Virtual Machines and Dynamic Translation: Implementing ISAs in Software

Advanced Computer Architecture

History of the Intel 80x86

IA-64, P4 HT and Crusoe Architectures Ch 15

Review Questions. 1 The DRAM problem [5 points] Suggest a solution. 2 Big versus Little Endian Addressing [5 points]

Practical Malware Analysis

RAID 0 (non-redundant) RAID Types 4/25/2011

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Outline. What Makes a Good ISA? Programmability. Implementability

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Getting CPI under 1: Outline

Outline. What Makes a Good ISA? Programmability. Implementability. Programmability Easy to express programs efficiently?

Communicating with People (2.8)

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

CS 31: Intro to Systems ISAs and Assembly. Martin Gagné Swarthmore College February 7, 2017

RISC I from Berkeley. 44k Transistors 1Mhz 77mm^2

Using MMX Instructions to Perform Simple Vector Operations

Digital Forensics Lecture 3 - Reverse Engineering

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions

Computer System Architecture

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Superscalar Processors

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

VLIW/EPIC: Statically Scheduled ILP

CS 31: Intro to Systems ISAs and Assembly. Kevin Webb Swarthmore College September 25, 2018

2.7 Supporting Procedures in hardware. Why procedures or functions? Procedure calls

How to write powerful parallel Applications

Computer Organization & Assembly Language Programming

Where we are. Instruction selection. Abstract Assembly. CS 4120 Introduction to Compilers

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

Systems Architecture I

Dynamic Control Hazard Avoidance

Computer Science 160 Translation of Programming Languages

CS 31: Intro to Systems ISAs and Assembly. Kevin Webb Swarthmore College February 9, 2016

ECE 486/586. Computer Architecture. Lecture # 7

administrivia final hour exam next Wednesday covers assembly language like hw and worksheets

Pentium IV-XEON. Computer architectures M

Homework. Reading. Machine Projects. Labs. Exam Next Class. None (Finish all previous reading assignments) Continue with MP5

Memory Models. Registers

CNIT 127: Exploit Development. Ch 3: Shellcode. Updated

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Assembly Language. Lecture 2 x86 Processor Architecture

CS Bootcamp x86-64 Autumn 2015

Assembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture Included elements of the IA-64 bit

Assembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Computer Science 146. Computer Architecture

The Geometry of Innocent Flesh on the Bone

Low-Level Essentials for Understanding Security Problems Aurélien Francillon

Program Exploitation Intro

Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1

Instruction Set Principles. (Appendix B)

Chapter 2. lw $s1,100($s2) $s1 = Memory[$s2+100] sw $s1,100($s2) Memory[$s2+100] = $s1

CS412/CS413. Introduction to Compilers Tim Teitelbaum. Lecture 21: Generating Pentium Code 10 March 08

IA-32 Architecture COE 205. Computer Organization and Assembly Language. Computer Engineering Department

Instruction Set Architecture

Instruction Set Architecture

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

Lecture 27. Pros and Cons of Pointers. Basics Design Options Pointer Analysis Algorithms Pointer Analysis Using BDDs Probabilistic Pointer Analysis

Using MMX Instructions to Implement 2D Sprite Overlay

A Real System Evaluation of Hardware Atomicity for Software Speculation

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale

Lecture 20 Pointer Analysis

Multiple Issue ILP Processors. Summary of discussions

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Instruction Set Architecture

CISC 360 Instruction Set Architecture

Advanced Computer Architecture

Credits and Disclaimers

Compiler construction. x86 architecture. This lecture. Lecture 6: Code generation for x86. x86: assembly for a real machine.

Second Part of the Course

INSTRUCTION LEVEL PARALLELISM

Introduction to Machine/Assembler Language

CS 152 Computer Architecture and Engineering. Lecture 22: Virtual Machines

DETECTION OF SOFTWARE DATA DEPENDENCY IN SUPERSCALAR COMPUTER ARCHITECTURE EXECUTION. Elena ZAHARIEVA-STOYANOVA, Lorentz JÄNTSCHI

Code Optimization April 6, 2000

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture

CS425 Computer Systems Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

IA-64 Architecture Innovations

CS429: Computer Organization and Architecture

Towards the Hardware"

CS 252 Graduate Computer Architecture. Lecture 15: Virtual Machines

For our next chapter, we will discuss the emulation process which is an integral part of virtual machines.

Lecture 14 Pointer Analysis

Understand the factors involved in instruction set

Credits and Disclaimers

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

W4118: PC Hardware and x86. Junfeng Yang

x86 assembly CS449 Fall 2017

COMPUTER ORGANIZATION AND DESI

The Processor: Instruction-Level Parallelism

ECE 486/586. Computer Architecture. Lecture # 8

CS61 Section Solutions 3

Transcription:

Dynamic Translation for EPIC Architectures David R. Ditzel Chief Architect for Hybrid Computing, VP IAG Intel Corporation Presentation for 8 th Workshop on EPIC Architectures April 24, 2010 1 Dynamic Translation for EPIC 1 CGO 2010

Thesis: The future of computing belongs to EPIC Architectures EPIC: Explicitly Parallel Instruction Computer or Exposed Parallelism Instruction Computer Parallelism exposed for software to exploit Examples Itanium, GPGPU s, Transmeta Efficeon/Crusoe My belief: EPIC is a more power efficient approach Dynamic translation will improve power advantages May be a different EPIC than we know today Dynamic Translation for EPIC 2 CGO 2010

Biggest challenge Power is the limiter We must move to more efficient computing structures or # cores could be limited Dynamic Translation for EPIC 3 CGO 2010

Simple Power Scaling Example Power = Cdyn x Voltage 2 x Frequency + Leakage (33%) Moore s Law says # devices can double every node 4 cores go to 128 cores over 10 years How does power limit this expectation? With an upper power limit of ~100 Watts, how many cores? Easy to calculate scaling per node: Voltage scaling about 0.9x Cdyn scaling about 0.8x Assume frequency increase of 1.2x From this data we can see how many cores we can have if we do not change to a more efficient approach Dynamic Translation for EPIC 4 CGO 2010

Power Limits # of Big Cores Year 2008 2010 2012 2014 2016 2018 Technology Node (nm) 45 32 22 15 11 8 Total Power 100 Power/core 25 Freq 3.0 Voltage 1.0 Cdyn/Core 5.6 Expected #Cores 4 8 16 32 64 128 Power Limited #Cores 4 Dynamic Translation for EPIC 5 CGO 2010

Power Limits # of Big Cores Year 2008 2010 2012 2014 2016 2018 Technology Node (nm) 45 32 22 15 11 8 Total Power 100 100 100 100 100 100 Power/core 25 Freq 3.0 3.6 4.3 5.2 6.2 7.5 Voltage 1.0 0.9 0.8 0.7 0.7 0.6 Cdyn/Core 5.6 4.4 3.6 2.8 2.3 1.8 Expected #Cores 4 8 16 32 64 128 Power Limited #Cores 4 Dynamic Translation for EPIC 6 CGO 2010

Power Limits # of Big Cores Year 2008 2010 2012 2014 2016 2018 Technology Node (nm) 45 32 22 15 11 8 Total Power 100 100 100 100 100 100 Power/core 25 19 15 12 9 7 Freq 3.0 3.6 4.3 5.2 6.2 7.5 Voltage 1.0 0.9 0.8 0.7 0.7 0.6 Cdyn/Core 5.6 4.4 3.6 2.8 2.3 1.8 Expected #Cores 4 8 16 32 64 128 Power Limited #Cores 4 Dynamic Translation for EPIC 7 CGO 2010

Power Limits # of Big Cores Year 2008 2010 2012 2014 2016 2018 Technology Node (nm) 45 32 22 15 11 8 Total Power 100 100 100 100 100 100 Power/core 25 19 15 12 9 7 Freq 3.0 3.6 4.3 5.2 6.2 7.5 Voltage 1.0 0.9 0.8 0.7 0.7 0.6 Cdyn/Core 5.6 4.4 3.6 2.8 2.3 1.8 Expected #Cores 4 8 16 32 64 128 Power Limited #Cores 4 5 7 9 11 14 Dynamic Translation for EPIC 8 CGO 2010

Power Limits # of Big Cores Year 2008 2010 2012 2014 2016 2018 Technology Node (nm) 45 32 22 15 11 8 Total Power 100 100 100 100 100 100 Power/core 25 19 15 12 9 7 Freq 3.0 3.6 4.3 5.2 6.2 7.5 Voltage 1.0 0.9 0.8 0.7 0.7 0.6 Cdyn/Core 5.6 4.4 3.6 2.8 2.3 1.8 Expected #Cores 4 8 16 32 64 128 Power Limited #Cores 4 5 7 9 11 14 We need to improve the efficiency of each core or we will suffer severe performance reduction Dynamic Translation for EPIC 9 CGO 2010

So how do we build improved cores? Dynamic Translation for EPIC 10 CGO 2010

Premise Change of perspective needed Software should be part of the picture Hardware co-designed with software increases the available options Software needs a simple model of the cost of an instruction Out-of-order processors made this impossible In-order EPIC processor can provide this simple model Software can do a very good job of scheduling, but only if the scheduling blocks are large enough Let s look at an example of how to increase block size and improve scheduling Dynamic Translation for EPIC 11 CGO 2010

Compiler optimization example Conditional branches tend to have a very biased program behavior Exploitable by compiler Correctness makes it difficult Fixup code for cold exits Exceptions A little special purpose hardware can make it much easier tst.ne p1, ecx, ecx brc assert ~p1 p1, D or eax, zero, 1 ld r32, edx, [esp + 112] or ebx, zero, 0 st ebx, [r32] ld esi, [ebp + 0x878] cmp.ne p1 edi, 72 brc assert ~p1 p1, E or eax, zero, 1 ld ebx, [ebp] ld ebx, [ebx + esi*4] ld edx, [esp + 112] st ebx, [edx] tst.ne p1, ecx, ecx brc p1, F Dynamic Translation for EPIC 12 CGO 2010

Hardware atomicity Hardware executes a region of code completely or not at all Common case is fast Uncommon case rolls back Resume in non-specialized code tst.ne p1, ecx, ecx brc assert ~p1 p1, D or eax, zero, 1 ld r32, edx, [esp + 112] or ebx, zero, 0 st ebx, [r32] ld esi, [ebp + 0x878] cmp.ne p1 edi, 72 brc assert ~p1 p1, E or eax, zero, 1 ld ebx, [ebp] ld ebx, [ebx + esi*4] ld edx, [esp + 112] st ebx, [edx] tst.ne p1, ecx, ecx brc p1, F Dynamic Translation for EPIC 13 CGO 2010

Interpreter Runtime Translations Dynamic binary translation test jne ecx, ecx D tst.ne p1, ecx, ecx assert brc ~p1 p1, D x86 Applications x86 OS mov eax, 1 mov esi, [esp + 112] xor ebx,ebx mov [esi], ebx mov esi, [ebp + 0x878] cmp edi, 72 jne E or eax, zero, 1 ld r32, edx, [esp + 112] or ebx, zero, 0 st ebx, [r32] ld esi, [ebp + 0x878] cmp.ne p1 edi, 72 assert brc ~p1 p1, E x86 ISA Code Morphing Software RISC ISA x86 processor EPIC Processor mov eax, 1 mov ebx, [ebp] mov ebp, [ebx + esi*4] mov edx, [esp + 112] mov [edx], ebx test ecx,ecx jne F or eax, zero, 1 ld ebx, [ebp] ld ebx, [ebx + esi*4] or edx, r32, 0 st ebx, [edx] tst.ne p1, ecx, ecx brc p1, F Dynamic Translation for EPIC 14 CGO 2010

Efficeon Processor Example Up to 6-issue/clock EPIC style architecture 2 loads or stores 2 integer ALU 2 SIMD 1 branch/call or other control Co-designed with CMS Includes hardware atomicity under software control Commit Rollback Dynamic Translation for EPIC 15 CGO 2010

Efficeon Hardware Example Each clock, processor can issue from one to six 32-bit instruction atoms to 11 functional units Instruction atom1 atom2 atom3 atom4 atom5 atom6 atom7 atom8 Load or Store or 32-bit add Load or Store or 32-bit add Integer ALU-1 Integer ALU-2 Alias Control Functional Units FP / SIMD FP / SIMD Branch Exec-1 Exec-2 Dynamic Translation for EPIC 16 CGO 2010

Code Morphing Software 4 Gear System Significantly Improved Responsiveness and Overall Performance 1 st Gear Executes 1 instruction at a time Profiles code at runtime Gathers data for flow analysis Gathers branch frequencies and directions Detects load/store typing (IO vs memory) Filters out infrequently executed code No startup cost Lowest speed Dynamic Translation for EPIC 17 CGO 2010

Code Morphing Software 4 Gear System Significantly Improved Responsiveness and Overall Performance 1 st Gear 2 nd Gear Uses profile data to create initial translations after code reaches 1 st threshold. Translates a Region of up to100 x86 instructions. Adds flow graph Shape information Light Optimization Greedy scheduling Low translation overhead Fast execution Dynamic Translation for EPIC 18 CGO 2010

Code Morphing Software 4 Gear System Significantly Improved Responsiveness and Overall Performance 1 st Gear 2 nd Gear 3 rd Gear Further optimizes the 2 nd gear regions Common sub-expression elimination Memory re-ordering Significant code optimization Critical path scheduling Medium translation overhead Faster execution Dynamic Translation for EPIC 19 CGO 2010

Code Morphing Software 4 Gear System Significantly Improved Responsiveness and Overall Performance 1 st Gear 2 nd Gear 3 rd Gear 4 th Gear Most advanced optimizations for hottest code regions. Splices together multiple regions Optimizes across region boundaries Used advanced behavioral data Critical path scheduling Highest translation overhead Fastest execution Dynamic Translation for EPIC 20 CGO 2010

CMS Translator Optimization Sophisticated translation optimizer Quickly applies many optimizations if-conversion, loop unrolling, constant folding and propagation, common-subexpression elimination, deadcode-elimination, loop-invariant code motion, superblock scheduling New optimizations must have low overhead Dynamic Translation for EPIC 21 CGO 2010

Example from Vortex Dynamic opportunities in CMS translation Hottest method in Vortex: OaGetObject Potential to eliminate x86 insts 56 -> 47 dynamic x86 insts 19% reduction same condition dead store partially redundant load killed by partially dead op store->load forward CMS relies on superblock abstraction Does not expose available opportunity partially redundant load partially redundant load Dynamic Translation for EPIC 22 CGO 2010

Atomic region abstraction Atomic regions trivially expose opportunity Convert biased control flow into assert operations Represent as dataflow op in IR Traditional optimizations can now exploit speculative opportunity Emit as conditional branch to jump to rollback and recover Retry in the interpreter or another translation Dynamic Translation for EPIC 23 CGO 2010

MEMORY MEMORY INT ALU INT ALU BRANCH Atomic region benefits Dependence graph shown Atomic region trivially enables optimizer to eliminate operations 88 -> 73 Efficeon operations 17% reduction Relaxes scheduling constraints 26 -> 19 cycles 27% reduction Dynamic Translation for EPIC 24 CGO 2010

Why we need EPIC architectures EPIC architectures offer many advantages Simplified hardware Simpler to design Smaller cores means more cores per die Enables software scheduling EPIC architectures are easier for DBT to schedule Better scheduling is the key to future performance gains Power In-order pipelines for EPIC are power efficient Less hardware for OOO means lower power More amenable to new power saving techniques Dynamic Translation for EPIC 25 CGO 2010

Why we need Dynamic Translation Good reasons for Dynamic Binary Translation (DBT) Innovation To allow processor innovation not tied to particular instruction sets Using DBT to provide backwards compatibility DBT system hidden from standard software CMS as microcode Performance To enable new means to improve processor performance DBT can provide access to new performance features Power Dynamic optimization is good for power e.g., optimizing away half the instructions is twice as energy efficient Dynamic Translation for EPIC 26 CGO 2010

Conclusions Binary Translation with EPIC architectures are a good combination. Special purpose hardware support is needed, co-designed with software, in order to provide good performance and power efficiency. Special care is needed to keep translation overhead low. Many opportunities for clever hardware/software co-design tradeoffs This is a technological approach still in its infancy Prediction: Dynamic Binary Translation will become a basic technique used in future processor design, as integral as logic gates and microcode are today. Dynamic Translation for EPIC 27 CGO 2010

END OF SLIDES