Chapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002,

Similar documents
計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

Multiple Instruction Issue. Superscalars

Computer Science 246 Computer Architecture

Multi-cycle Instructions in the Pipeline (Floating Point)

Hardware-based Speculation

5008: Computer Architecture

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Four Steps of Speculative Tomasulo cycle 0

Getting CPI under 1: Outline

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Multiple Issue ILP Processors. Summary of discussions

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Lecture: Pipeline Wrap-Up and Static ILP

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Instruction-Level Parallelism and Its Exploitation

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Advanced issues in pipelining

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Advanced Computer Architecture

HY425 Lecture 09: Software to exploit ILP

Processor (IV) - advanced ILP. Hwansoo Han

HY425 Lecture 09: Software to exploit ILP

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

EECC551 Exam Review 4 questions out of 6 questions

ECSE 425 Lecture 11: Loop Unrolling

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Hiroaki Kobayashi 12/21/2004

CS152 Computer Architecture and Engineering. Complex Pipelines

The Processor: Instruction-Level Parallelism

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

CS425 Computer Systems Architecture

Dynamic Scheduling. CSE471 Susan Eggers 1

SUPERSCALAR AND VLIW PROCESSORS

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

The IA-64 Architecture. Salient Points

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Instruction-Level Parallelism (ILP)

Chapter 4. The Processor

ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors

ארכי טק טורת יחיד ת עיבוד מרכזי ת

INSTRUCTION LEVEL PARALLELISM

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Full Datapath. Chapter 4 The Processor 2

Computer Science 146. Computer Architecture

EE 4683/5683: COMPUTER ARCHITECTURE

Metodologie di Progettazione Hardware-Software

RECAP. B649 Parallel Architectures and Programming

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Course on Advanced Computer Architectures

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors

TDT 4260 lecture 7 spring semester 2015

Hardware-Based Speculation

CMSC 611: Advanced Computer Architecture

Floating Point/Multicycle Pipelining in DLX

Lecture 13 - VLIW Machines and Statically Scheduled ILP

The University of Texas at Austin

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Computer Architecture: Mul1ple Issue. Berk Sunar and Thomas Eisenbarth ECE 505

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

Lecture 9: Multiple Issue (Superscalar and VLIW)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Lecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)

Pipelining and Vector Processing

Chapter 06: Instruction Pipelining and Parallel Processing

VLIW/EPIC: Statically Scheduled ILP

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

Inherently Lower Complexity Architectures using Dynamic Optimization. Michael Gschwind Erik Altman

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S


LECTURE 10. Pipelining: Advanced ILP

E0-243: Computer Architecture

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

Instr. execution impl. view

Chapter 4 The Processor (Part 4)

Chapter 4 The Processor 1. Chapter 4D. The Processor

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Instruction Scheduling

Transcription:

Chapter 3 (Cont III): Exploiting ILP with Software Approaches Copyright Josep Torrellas 1999, 2001, 2002, 2013 1

Exposing ILP (3.2) Want to find sequences of unrelated instructions that can be overlapped in the pipeline. Separate dependent instructions by a distance in clock cycles equal to the latency of the source instruction. Assumptions: basic 5-stage pipeline, 1-cycle delay branches, and fully pipelined FUs or replicated Note: In Figure 3.2, latency means number of intervening instructions 2

Loop Unrolling Example page 158-160 Unrolling: replicate the loop body several times, adjusting the loop termination code use different regs in every iteration, thus increasing register pressure In real life, we do not know the number of iterations: assume n iterations, k bodies per iteration separate into two loops: One executes (n mod k) times and has the original body Other executes (n/k) times and has k bodies 3

Loop Unrolling Summary Advantages: more ILP fewer overheard instructions Disadvantages: code size increases register pressure increases problem becomes worse in multiple issue processors Example: went from 9 cycles/elem to 3.5 cycles Copyright Josep Torrellas 1999, 2001, 2002 4

VLIW Approach Superscalar hardware is tough E.g. for the two issue superscalar: every cycle we examine the opcode of 2 instr, the 6 registers specifiers and we dynamically determine whether one or two instructions can issue and dispatch them to the appropriate FUs VLIW: packages multiple independent ops into one very long instruction Burden of finding indep instructions is on the compiler No deps within issue package or at least indicate when a dep is present Hardware simpler Example VLIW instruction: 2 int, 2 fp, 2 mem, 1 branch An instruction has a set of fields for each FU à 112-168 bits per instruction 5

VLIW Approach Early VLIW: rigid in their instruction formats require recompilation of programs for different versions of the HW Newer VLIWs: still require compiler to do most of the work to find/schedule instruc however, add innovations to increase flexibility 6

VLIW Approach (3.7) To keep all FU busy à need enough parallelism in code How to uncover parallelism? Unrolling loops à eliminates branches Scheduling code within basic block à local scheduling Scheduling code across basic blocks à global scheduling One global scheduling technique: Trace Scheduling 7

Example: 2 mem + 2 fp + 1 int or branch per instr ignore the branch delay slot unroll the typical loop to eliminate all stalls (issue 1 VLIW instruction per cycle) Figure 3.16: seven copies of the body take 9 cycles, or 1.29 cycles/result, which is about twice as fast as superscalar. Note: Efficiency is 60% (slots with operation) Requires a LARGE number of registers! 8

Problems in the Original VLIW Model Increase in code size Generating enough parallelism requires wide loop unrolling Many instructions are not full If no operation can be scheduled, whole instruction empty Limitations of lockstep operation Any stall in any FU pipeline must cause entire processor to stall (all FUs kept synchronized) Easy for deterministic FUs, but not for cache misses Cache misses caused all FU to stall! 9

Problems in the Original VLIW Model Binary code compatibility Code generation needs to know the detailed pipeline structure, including FUs and latencies Migrating between successive implementations or between implementations with different issue width requires recompilation (unlike dynamic superscalars) Finding enough parallelism (problem for all multiple issue processors) EPIC (IA-64) solves some of these problems 10

Combating Code Size Increase Use clever encoding Have one large immediate field used by any FU Compress the instructions in main memory and expand them when read into the cache or decoded 11

Lock Step Operation Lock step operation is not acceptable. Recent VLIW processors: FU operate more independently Hardware checks for hazards and allows unsynchronized execution 12

Solving Migration Problem Use object code translation or emulation Finding Enough Parallelism If need loop unrolling in FP programs: Vector processors can do just as well However, advantages of multiple issue procs over vector procs: Ability to extract some parallelism from less structured codes Ability to use a conventional, cheap cache-based mem system Consequently: use multiple issue procs and add a vector extension to them 13