Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Similar documents
EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

CS425 Computer Systems Architecture

Getting CPI under 1: Outline

Advanced Computer Architecture

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

INSTRUCTION LEVEL PARALLELISM

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Lecture 9: Multiple Issue (Superscalar and VLIW)

Hardware-based Speculation

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Agenda. What is the Itanium Architecture? Terminology What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division

5008: Computer Architecture

VLIW/EPIC: Statically Scheduled ILP

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Lecture 13 - VLIW Machines and Statically Scheduled ILP

Processor (IV) - advanced ILP. Hwansoo Han

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Hardware-Based Speculation

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

CS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP

Multi-cycle Instructions in the Pipeline (Floating Point)

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Exploitation of instruction level parallelism

UCI. Intel Itanium Line Processor Efforts. Xiaobin Li. PASCAL EECS Dept. UC, Irvine. University of California, Irvine

Advanced issues in pipelining

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Computer Science 146. Computer Architecture

TDT 4260 lecture 7 spring semester 2015

Lecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

Super Scalar. Kalyan Basu March 21,

Intel Enterprise Processors Technology

Intel released new technology call P6P

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

EECC551 Exam Review 4 questions out of 6 questions

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Advanced processor designs

CS 152 Computer Architecture and Engineering. Lecture 16 - VLIW Machines and Statically Scheduled ILP

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

IA-64, P4 HT and Crusoe Architectures Ch 15

The Processor: Instruction-Level Parallelism

The IA-64 Architecture. Salient Points

Hardware-Based Speculation

Copyright 2012, Elsevier Inc. All rights reserved.

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Multiple Issue ILP Processors. Summary of discussions

Basic Computer Architecture

Lecture 6: Static ILP

Four Steps of Speculative Tomasulo cycle 0

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

Static Multiple-Issue Processors: VLIW Approach

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Itanium 2 Processor Microarchitecture Overview

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

HY425 Lecture 09: Software to exploit ILP

ILP: COMPILER-BASED TECHNIQUES

HY425 Lecture 09: Software to exploit ILP

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Static vs. Dynamic Scheduling

Instruction-Level Parallelism (ILP)

Understanding the IA-64 Architecture

Instruction Level Parallelism

CS433 Homework 3 (Chapter 3)

Multiple Instruction Issue. Superscalars

Chapter 4 The Processor (Part 4)

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

ECSE 425 Lecture 11: Loop Unrolling

Advanced Computer Architecture

Computer Science 246 Computer Architecture

Lecture: Static ILP. Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

HW 2 is out! Due 9/25!

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

TDT 4260 TDT ILP Chap 2, App. C

Chapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002,

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Metodologie di Progettazione Hardware-Software

Handout 2 ILP: Part B

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Instruction-Level Parallelism and Its Exploitation

Static Compiler Optimization Techniques

ILP: Instruction Level Parallelism

Computer Architecture: Mul1ple Issue. Berk Sunar and Thomas Eisenbarth ECE 505

Lecture: Pipeline Wrap-Up and Static ILP

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

E0-243: Computer Architecture

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Transcription:

Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Basic Instruction Scheduling Reschedule the order of the instructions to reduce the hazards Ex: for (i = 1000; i > 0 ; i = i-1) x[i] = x[i] + s; Loop: L.D F0, 0(R1) Add.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop

Example Non-scheduled code and execution Loop: L.D F0, 0(R1) Add.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Loop: L.D F0, 0(R1) stall Add.D F4, F0, F2 stall stall S.D F4, 0(R1) DADDI R1, R1, #-8 stall BNE R1, R2, Loop stall

Example (cont d) Scheduled code and execution Loop: L.D F0, 0(R1) DADDUI R1, R1, #-8 Add.D F4, F0, F2 BNE R1, R2, Loop S.D F4, -8(R1) Loop: L.D F0, 0(R1) DADDUI R1, R1, #-8 Add.D F4, F0, F2 stall BNE R1, R2, Loop S.D F4, -8(R1)

Loop Unrolling Unrolling the loop bodies to reduce the loop overhead and increase the ILP Example: Still use the previous example

Example Loop: L.D F0, 0(R1) Add.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Loop: L.D F0, 0(R1) Add.D F4, F0, F2 S.D F4, 0(R1) L.D F6, -8(R1) Add.D F8, F6, F2 S.D F8, -8(R1) L.D F10, -16(R1) Add.D F12, F10, F2 S.D F12,-16(R1) L.D F14, -24(R1) Add.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1, F2, loop

Example (Cont d) Loop: L.D F0, 0(R1) Add.D F4, F0, F2 S.D F4, 0(R1) L.D F6, -8(R1) Add.D F8, F6, F2 S.D F8, -8(R1) L.D F10, -16(R1) Add.D F12, F10, F2 S.D F12,-16(R1) L.D F14, -24(R1) Add.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1, F2, loop Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) L.D F14, -24(R1) Add.D F4, F0, F2 Add.D F8, F6, F2 Add.D F12, F10, F2 Add.D F16, F14, F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, #-32 S.D F12,16(R1) BNE R1, F2, loop S.D F16, 8(R1)

Static Branch Prediction Predicting the branch taken/untaken Always taken On the basis of branch direction On the basis of profile information

Software Pipelining Choosing instructions from different loop iterations Example

Example Loop: L.D F0, 0(R1) Add.D F4, F0, F2 S. D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Iter1: L.D F0, 0(R1) Add.D F4, F0, F2 S.D F4, 0(R1) Iter2: L.D F0, 0(R1) Add.D F4, F0, F2 S.D F4, 0(R1) Iter3: L.D F0, 0(R1) Add.D F4, F0, F2 S.D F4, 0(R1) Loop: S.D F4, 16(R1) Add.D F4, F0, F2 L. D F0, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop

Example (cont d) Loop: S.D F4, 16(R1) Add.D F4, F0, F2 L. D F0, 0(R1) DADDUI R1, R1, #-8 stall BNE R1, R2, Loop stall Loop: S.D F4, 16(R1) DADDUI R1, R1, #-8 Add.D F4, F0, F2 BNE R1, R2, Loop L. D F0, 8(R1)

Software Pipelining Use loop body to separate the instructions with data dependency Combined loop unrolling and software pipelining Less code space Can be quite difficult May require significant transformation Trade-offs are complex Register management can be difficult

To Further Exploit the ILP: The VLIW Approach VLIW: Very long width instructions Multiple, independent functional units Multiple operations per instruction Compiler helps to organize the instructions to avoid the hazards

Example Two memory access reference Two FP operations One integer operations or branch Unrolling the loop as necessary x[i] = x[i] + s

Mem Ref1 Mem Ref2 FP Op1 FP op2 Int Op/Branch L.D F0, 0(R1) L.D. F6, -8(R1) L.D. F10, -16(R1) L.D F14, -24(R1) L.D F18, -32(R1) L.D. F22, -40(R1) Add.D F4, F0,F2 Add. D F8, F6, F2 L.D F26, -48(R1) Add.D F12, F10,F2 Add.D F16, F14,F2 Add.D F20, F18,F2 Add.D F24, F22,F2 S.D F4, 0(R1) S.D F8, -8(R1) Add.D F28, F26,F2 S.D F12, -16(R1) S.D F16, -24(R1) DADDUI, R1, F1, #-56 S.D F20, 24(R1) S.D F24, 16(R1) S.D F28, 8(R1) BNE R1, R2, Loop

Problems for VLIW Approach Technical problems Increased code size Unrolling loops increases code size Wasted bits in the instructions when instructions are not full A stall in a functional unit causes other units to stall Logistical problem Binary code compatibility

VLIW or EPIC VLIW (very long instruction word): Compiler packs a fixed number of instructions into a single VLIW instruction. The instructions within a VLIW instruction are issued and executed in parallel Example: High-end signal processors (TMS320C6201) EPIC (explicit parallel instruction computing): Evolution of VLIW Example: Intel s IA-64, exemplified by the Itanium processor

EPIC Processors, Intel's IA-64 ISA and Itanium Joint R&D project by Hewlett-Packard and Intel (announced in June 1994) This resulted in explicitly parallel instruction computing (EPIC) design style: specifying ILP explicit in the machine code, that is, the parallelism is encoded directly into the instructions similarly to VLIW; a fully predicated instruction set; an inherently scalable instruction set (i.e., the ability to scale to a lot of FUs); many registers; speculative execution of load instructions

EPIC design challenges Develop architectures applicable to general-purpose computing Find substantial parallelism in difficult to parallelize scalar programs Provide compatibility across hardware generations Support emerging applications (e.g. multimedia) Compiler must find or create sufficient ILP Combine the best attributes of VLIW & superscalar RISC (incorporated best concepts from all available sources) Scale architectures for modern single-chip implementation

IA-64 Architecture Unique architecture features & enhancements Explicit parallelism and templates Predication, speculation, memory support, and others Floating-point and multimedia architecture IA-64 resources available to applications Large, application visible register set Rotating registers, register stack, register stack engine IA-32 & PA-RISC compatibility models

Today s Architecture Challenges Performance barriers : Memory latency Branches Loop pipelining and call / return overhead Headroom constraints : Hardware-based instruction scheduling Unable to efficiently schedule parallel execution Resource constrained Too few registers Unable to fully utilize multiple execution units Scalability limitations : Memory addressing efficiency

Intel's IA-64 ISA Intel 64-bit Architecture (IA-64) register model: 128 64-bit general purpose registers GR0-GR127 to hold values for integer and multimedia computations each register has one additional NaT (Not a Thing) bit to indicate whether the value stored is valid, 128 82-bit floating-point registers FR0-FR127 registers f0 and f1 are read-only with values +0.0 and +1.0, 64 1-bit predicate registers P0-PR63 the first register p0 is read-only and always reads 1 (true) 8 64-bit branch registers BR0-BR7 to specify the target addresses of indirect branches

IA-64 s Large Register File GR0 GR1 Integer Registers 63 0 0 GR0 GR1 Floating-Point Registers 81 0 0.0 Branch Registers 63 0 BR0 BR7 Predicate Registers PR0 PR1 bit 0 1 GR31 GR32 GR31 GR32 PR15 PR16 GR127 GR127 PR63 NaT 32 Static 32 Static 16 Static 96 Stacked, Rotating 96 Rotating 48 Rotating

Intel's IA-64 ISA IA-64 instructions are 41-bit (previously stated 40 bit) long and consist of op-code, predicate field (6 bits), two source register addresses (7 bits each), destination register address (7 bits), and special fields (includes integer and floating-point arithmetic). The 6-bit predicate field in each IA-64 instruction refers to a set of 64 predicate registers. 6 types of instructions A: Integer ALU ==> I-unit or M-unit I: Non-ALU integer ==> I-unit M: Memory ==> M-unit B: Branch ==> B-unit F: Floating-point ==> F-unit L: Long Immediate ==> I-unit IA-64 instructions are packed by compiler into bundles.

Memory Support for High Performance Technical Computing Scientific analysis, 3D graphics and other technical workloads tend to be predictable & memory bound IA-64 data pre-fetching of operations allows for fast access of critical information Reduces memory latency impact IA-64 able to specify cache allocation Cache hints from load / store operations allow data to be placed at specific cache level Efficient use of caches, efficient use of bandwidth

IA Server/Workstation Roadmap McKinley Madison IA-64 Perf Deerfield IA-64 Price/Perf Performance Itanium Foster Future IA-32...... Pentium III Xeon Proc. Pentium II Xeon TM Processor 98 99 00 01 02.25µ.18µ.13µ IA-64 starts with Merced processor 03 All dates specified are target dates provided for planning purposes only and are subject to change.

Itanium 64-bit processor ==> not in the Pentium, PentiumPro, Pentium II/III-line Targeted at servers with moderate to large numbers of processors full compatibility with Intel s IA-32 ISA EPIC (explicitly parallel instruction computing) is applied. 6-wide (3 EPIC instructions) pipeline 10 stage pipeline 4 int, 4 multimedia, 2 load/store, 3 branch, 2 extended floating-point, 2 singleprec. Floating-point units Multi-level branch prediction besides predication 16 KB 4-way set-associative d- and I-caches 96 KB 6-way set-associative L2 cache 4 MB L3 cache (on package) 800 MHz, 0.18 micro process (at beginning of 2001) shipments end of 1999 or mid-2000 or??

Conceptual View of Itanium

Itanium Processor Core Pipeline ROT: instruction rotation pipelined access of the large register file: WDL: word line decode: REG: register read DET: exception detection (~retire stage)

Itanium Die Plot

Itanium vs. Willamette (P4) Itanium announced with 800 MHz P4 announced with 1.2 GHz P4 may be faster in running IA-32 code than Itanium running IA-64 code Itanium probably won t compete with contemporary IA-32 processors but Intel will complete the Itanium design anyway Intel hopes for the Itanium successor McKinley which will be out only one year later