ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

Similar documents
ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

Lecture 13 - VLIW Machines and Statically Scheduled ILP

CS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP

CS 152 Computer Architecture and Engineering. Lecture 16 - VLIW Machines and Statically Scheduled ILP

Advanced Computer Architecture

CS425 Computer Systems Architecture

VLIW/EPIC: Statically Scheduled ILP

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

Lecture 9: Multiple Issue (Superscalar and VLIW)

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 8 Instruction-Level Parallelism Part 1

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152, Spring 2011 Section 8

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 6 Pipelining Part 1

CS 152 Computer Architecture and Engineering. Lecture 16: Vector Computers

Hardware-based Speculation

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CS 152, Spring 2011 Section 10

CS252 Graduate Computer Architecture Midterm 1 Solutions

ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors

Four Steps of Speculative Tomasulo cycle 0

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Lecture 14: Multithreading

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

5008: Computer Architecture

INSTRUCTION LEVEL PARALLELISM

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Exploitation of instruction level parallelism

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Getting CPI under 1: Outline

Processor (IV) - advanced ILP. Hwansoo Han

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

The Processor: Instruction-Level Parallelism

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Computer Science 246 Computer Architecture

計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Hardware-Based Speculation

Dynamic Scheduling. CSE471 Susan Eggers 1

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Handout 2 ILP: Part B

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

The IA-64 Architecture. Salient Points

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Instruction-Level Parallelism and Its Exploitation

November 7, 2014 Prediction

CS433 Homework 3 (Chapter 3)

EECC551 Exam Review 4 questions out of 6 questions

Advanced issues in pipelining

Chapter 4 The Processor 1. Chapter 4D. The Processor

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Copyright 2012, Elsevier Inc. All rights reserved.

Static vs. Dynamic Scheduling

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Computer Science 146. Computer Architecture

CMSC411 Fall 2013 Midterm 2 Solutions

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

HY425 Lecture 09: Software to exploit ILP

Course on Advanced Computer Architectures

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Super Scalar. Kalyan Basu March 21,

HY425 Lecture 09: Software to exploit ILP

Lecture-13 (ROB and Multi-threading) CS422-Spring

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Agenda. What is the Itanium Architecture? Terminology What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division

Multiple Instruction Issue. Superscalars

Out of Order Processing

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

Hardware-Based Speculation

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 4 Reduced Instruction Set Computers

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Lecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)

COSC4201 Instruction Level Parallelism Dynamic Scheduling

CS 152 Computer Architecture and Engineering. Lecture 5 - Pipelining II (Branches, Exceptions)

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Transcription:

ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 15 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html

ECE252 Administrivia 13 November Homework #4 Due Project Proposals Should receive comments by end of week. ECE 552 / CPS 550 2

OoO Superscalar Complexity Out-of-order Superscalar - Objective: Increase instruction-level parallelism. - Cost: Hardware logic/mechanisms to track dependencies and dynamically schedule independent instructions. Hardware Complexity - Instructions can issue, complete out-of-order. - Instructions must commit in-order - Implement Tomasulo s algorithm with a variety of structures - Example: Reservation stations, reorder buffer, physical register file Very Long Instruction Word (VLIW) - Objective: Increase instruction-level parallelism. - Cost: Software compilers/mechanisms to track dependencies and statically schedule independent instructions. ECE 552 / CPS 550 3

Fetch Review of Fetch/Decode - Load instruction by accessing memory at program counter (PC) - Update PC using sequential address (PC+4) or branch prediction (BTB) Decode/Rename - Take instruction from fetch buffer - Allocate resources, which are necessary to execute instruction: (1) Destination physical register if instruction writes a register, rename (2) Reorder buffer (ROB) entry support in-order commit (3) Issue queue entry hold instruction as it waits for execution (4) Memory buffer entry resolve dependencies through memory (next slide) - Stall if resources unavailable - Rename source/destination registers - Update reorder buffer, issue queue, memory buffer ECE 552 / CPS 550 4

Review of Memory Buffer Allocate memory buffer entry Store Instructions - Calculate store-address and place in buffer - Take store-data and place in buffer - Instruction commits in-order when store-address, store-data ready Load Instructions - Calculate load-address and place in buffer - Instruction searches memory buffer for stores with matching address - Forward load data from in-flight stores with matching address - Stall load if buffer contains stores with un-resolved addresses ECE 552 / CPS 550 5

Issue Review of Issue/Execute - Instruction commits from reorder buffer - A commit wakes-up an instruction by marking its sources ready - Select logic determines which ready instructions should execute - Issue when by sending instructions to functional unit Execute - Read operands from physical register file and/or forwarding path - Execute instruction in functional unit - Write result to physical register file, store buffer - Produce exception status - Write to reorder buffer ECE 552 / CPS 550 6

Commit Review of Commit - Instructions can complete and update reorder buffer out-of-order - Instructions commit from reorder buffer in-order Exceptions - Check for exceptions - If exception raised, flush pipeline - Jump to exception handler Release Resources - Free physical register used by last writer to same architected register - Free reorder buffer slot - Free memory buffer slot ECE 552 / CPS 550 7

Control Logic Scaling Issue Width W Issue Group Previously Issued Instructions Lifetime L - Lifetime (L) number of cycles an instruction spends in pipeline - Lifetime depends on pipeline latency, time spent in reorder buffer - Issue width (W) maximum number of instructions issued per cycle - As W increases, issue logic must find more instructions to execute in parallel and keep pipeline busy. - More instructions must be fetched, decoded, and queued. - W x L instructions can impact any of the W issuing instructions (e.g. forwarding) and growth in hardware proportional to W x (W x L) ECE 552 / CPS 550 8

Control Logic (MIPS R10000) Control Logic ECE 552 / CPS 550 9

Sequential Instruction Sets Sequential source code a = foo(b); for (i=0, i< Superscalar compiler Sequential machine code Find independent operations Schedule operations Superscalar processor Check instruction dependencies Schedule execution ECE 552 / CPS 550 10

Sequential Instruction Sets Superscalar Compiler - Takes sequential code (e.g., C, C++) - Check instruction dependencies - Schedule operations to preserve dependencies - Produces sequential machine code (e.g., MIPS) Superscalar Processor - Takes sequential code (e.g., MIPS) - Check instruction dependencies - Schedule operations to preserve dependencies Inefficiency of Superscalar Processors - Performs dependency, scheduling dynamically in hardware - Expensive logic rediscovers schedules that a compiler could have found ECE 552 / CPS 550 11

VLIW Very Long Instruction Word Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency - Multiple operations packed into one instruction format - Instruction format is fixed, each slot supports particular instruction type - Constant operation latencies are specified (e.g., 1 cycle integer op) - Software schedules operations into instruction format, guaranteeing (1) Parallelism within an instruction no RAW checks between ops (2) No data use before ready no data interlocks/stalls ECE 552 / CPS 550 12

VLIW Compiler Responsibilities Schedule operations to maximize parallel execution - Fill operation slots Guarantee intra-instruction parallelism - Ensure operations within same instruction are independent Schedule to avoid data hazards - Separate options with explicit NOPs ECE 552 / CPS 550 13

Loop Execution for (i=0; i<n; i++) B[i] = A[i] + C; Int1 Int 2 M1 M2 FP+ FPx Compile loop: add r1 ld loop: ld f1, 0(r1) add r1, 8 fadd f2, f0, f1 sd f2, 0(r2) add r2, 8 bne r1, r3, loop Schedule add r2 bne sd fadd - The latency of each instruction is fixed (e.g., 3 cycle ld, 4 cycle fadd) - Instr-1: Load A[i] and increment i (r1) in parallel - Instr-2: Wait for load - Instr-3: Wait for add. Store B[i], increment i (r2), branch in parallel - How many flops / cycle? 1 fadd / 8 cycles = 0.125 ECE 552 / CPS 550 14

Loop Unrolling for (i=0; i<n; i++) B[i] = A[i] + C; - Unroll inner loop to perform k iterations of computation at once. - If N is not a multiple of unrolling factor k, insert clean-up code for (i=0; i<n; i+=4) { B[i] = A[i] + C; B[i+1] = A[i+1] + C; B[i+2] = A[i+2] + C; B[i+3] = A[i+3] + C; } - Example: unroll inner loop to perform 4 iterations at once ECE 552 / CPS 550 15

Scheduling Unrolled Loops loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) sd f8, 24(r2) add r2, 32 bne r1, r3, loop schedule loop: Int1 Int 2 M1 M2 FP+ FPx add r1 add r2 bne ld f1 ld f2 ld f3 ld f4 sd f5 sd f6 sd f7 sd f8 - Unroll loop to execute 4 iterations - Reduces number of empty operation slots fadd f5 fadd f6 fadd f7 fadd f8 - How many flops/cycle? 4 fadds / 11 cycles = 0.36 ECE 552 / CPS 550 16

Software Pipelining Exploit independent loop iterations - If loop iterations are independent, then get more parallelism by scheduling instructions from different iterations - Example: Loop iterations are independent in the code sequence below. - Construct the data-flow graph for one iteration load A[i] C for (i=0; i<n; i++) B[i] = A[i] + C; + store B[i] ECE 552 / CPS 550 17

Drain Steady State Fill Software Pipelining (Illustrated) Not pipelined load A[0] C + store B[0] load A[1] C Pipelined Load A[0] C + load A[1] C + store B[0] + load A[2] C store B[1] load A[2] C + store B[2] load A[3] C + store B[3] store B[1] + load A[3] C store B[2] + load A[4] C store B[3] + load A[5] C store B[4] + store B[5] ECE 552 / CPS 550 18

Scheduling SW Pipelined Loops Unroll the loop to perform 4 iterations at once. Then SW pipeline. for (i=0; i<n; i++) B[i] = A[i] + C; loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) add r2, 32 sd f8, -8(r2) bne r1, r3, loop steady state fill drain loop: Int1 Int 2 M1 M2 FP+ FPx add r1 add r1 add r1 add r2 ECE 552 / CPS 550 19 bne add r2 bne ld f1 ld f2 ld f3 ld f4 ld f1 ld f2 ld f3 ld f4 ld f1 ld f2 ld f3 ld f4 sd f5 sd f6 sd f7 sd f8 sd f5 sd f6 sd f7 sd f8 sd f5 fadd f5 fadd f6 fadd f7 fadd f8 fadd f5 fadd f6 fadd f7 fadd f8 fadd f5 fadd f6 fadd f7 fadd f8

What if there are no loops? - Basic block defined by sequence of consecutive instructions. Every basic block ends with a branch. Basic block - Instruction-level parallelism is hard to find in basic blocks - Basic blocks illustrated by control flow graph ECE 552 / CPS 550 21

Trace Scheduling - A trace is a sequence of basic blocks (a.k.a., long string of straight-line code) - Trace Selection: Use profiling or compiler heuristics to find common sequences/paths - Trace Compaction: Schedule whole trace into few VLIW instructions. - Add fix-up code to cope with branches jumping out of trace. Undo instructions if control flower diverges from trace. ECE 552 / CPS 550 22

Problems with Classic VLIW Object Code Challenges - Compatibility: Need to recompile code for every VLIW machine, even across generations. - Compatibility: Code specific to operation slots in instruction format and latencies of operations - Code Size: Instruction padding wastes instruction memory/cache with nops for unfilled slots. - Code Size: Loop unrolling, software pipelining increases code footprint. Scheduling Variable Latency Operations - Effective schedules rely on known instruction latencies - Caches, memories produce unpredictable variability ECE 552 / CPS 550 23

Problems with Classic VLIW Object Code Challenges - Compatibility: Need to recompile code for every VLIW machine, even across generations. - Compatibility: Code specific to operation slots in instruction format and latencies of operations - Code Size: Instruction padding wastes instruction memory/cache with nops for unfilled slots. - Code Size: Loop unrolling, software pipelining increases code footprint. Scheduling Variable Latency Operations - Effective schedules rely on known instruction latencies - Caches, memories produce unpredictable variability ECE 552 / CPS 550 24

Intel Itanium, EPIC IA-64 Explicitly Parallel Instruction Computing (EPIC) IA-64 - Computer architecture style (e.g., CISC, RISC, EPIC) - Instruction set architecture (e.g., x86, MIPS, IA-64) - IA-64 Intel Architecture 64-bit Implementations - Merced, first implementation, 2001 - McKinley, second implementation, 2002 - Poulson, recent implementation, 2011 ECE 552 / CPS 550 25

Intel Itanium, EPIC IA-64 - Eight cores - 1-cycle 16KB L1 I&D caches - 9-cycle 512KB L2 I-cache - 8-cycle 256KB L2 D-cache - 32 MB shared L3 cache - 544mm 2 in 32nm CMOS - 3 billion transistors - Cores are 2-way multithreaded - Each VLIW word is 128-bits, containing 3 instructions (op slots) - Fetch 2 words per cycle 6 instructions (op slots) - Retire 4 words per cycle 12 instructions (op slots) ECE 552 / CPS 550 26

VLIW and Control Flow Challenges Challenge - Mispredicted branches limit ILP - Trace selection groups basic blocks into larger ones - Trace compaction schedules instructions into a single VLIW - Requires fix-up code for branches that exit trace Solution Predicated Execution - Eliminate hard to predict branches with predicated execution - IA-64 instructions can be executed conditionally under predicate - Instruction becomes a NOP if predicate register is false ECE 552 / CPS 550 28

VLIW and Control Flow Challenges b0: b1: b2: b3: Inst 1 Inst 2 br a==b, b2 Inst 3 Inst 4 br b3 Inst 5 Inst 6 Inst 7 Inst 8 Four basic blocks if else then Predication Inst 1 Inst 2 p1,p2 <- cmp(a==b) (p1) Inst 3 (p2) Inst 5 (p1) Inst 4 (p2) Inst 6 Inst 7 Inst 8 One basic block On average >50% branches removed. Mahlke et al. 1995. ECE 552 / CPS 550 29

Limits of Static ILP Software Instruction-level Parallelism (VLIW) - Compiler complexity - Code size explosion - Unpredictable branches - Variable memory latency and unpredictable cache behavior Current Status - Despite several attempts, VLIW has failed in general-purpose computing - VLIW hardware complexity similar to in-order, superscalar hardware complexity. Limited advantage on large, complex applications - Successful in embedded digital signal processing; friendly code ECE 552 / CPS 550 30

Summary Out-of-order Superscalar Hardware complexity increases super-linearly with issue-width. Very Long Instruction Word (VLIW) Compiler explicitly schedules parallel instructions Unrolling and software pipelining loops Predication Mitigates branches in VLIW machines Predicate operations. If predicate false, operation does not affect architected state. Mahlke et al. A comparison of full and partial predicated execution support for ILP processors 1995. ECE 552 / CPS 550 31

Acknowledgements These slides contain material developed and copyright by - Arvind (MIT) - Krste Asanovic (MIT/UCB) - Joel Emer (Intel/MIT) - James Hoe (CMU) - Arvind Krishnamurthy (U. Washington) - John Kubiatowicz (UCB) - Alvin Lebeck (Duke) - David Patterson (UCB) - Daniel Sorin (Duke) ECE 552 / CPS 550 32