ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 8 Instruction-Level Parallelism Part 1

Similar documents
Complex Pipelining. Motivation

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

Lecture 4 Pipelining Part II

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 6 Pipelining Part 1

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

EC 513 Computer Architecture

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 4 Pipelining Part II

Instruction-Level Parallelism and Its Exploitation

Processor: Superscalars Dynamic Scheduling

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 4 Reduced Instruction Set Computers

Detailed Scoreboard Pipeline Control. Three Parts of the Scoreboard. Scoreboard Example Cycle 1. Scoreboard Example. Scoreboard Example Cycle 3

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Announcement. ECE475/ECE4420 Computer Architecture L4: Advanced Issues in Pipelining. Edward Suh Computer Systems Laboratory

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Advantages of Dynamic Scheduling

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling Instruction level parallelism

COSC 6385 Computer Architecture - Pipelining (II)

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

CS 152, Spring 2011 Section 8

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

COSC4201 Instruction Level Parallelism Dynamic Scheduling

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Scoreboard information (3 tables) Four stages of scoreboard control

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Good luck and have fun!

Lecture 13 - VLIW Machines and Statically Scheduled ILP

Metodologie di Progettazione Hardware-Software

CS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Lecture-13 (ROB and Multi-threading) CS422-Spring

CS425 Computer Systems Architecture

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

Lecture 4: Introduction to Advanced Pipelining

CS152 Computer Architecture and Engineering. Complex Pipelines

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CS 152 Computer Architecture and Engineering. Lecture 16: Vector Computers

Advanced issues in pipelining

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Tomasulo s Algorithm

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

CS 152 Computer Architecture and Engineering. Lecture 16 - VLIW Machines and Statically Scheduled ILP

Lecture 4 - Pipelining

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Instruction Level Parallelism

Lecture 9: Multiple Issue (Superscalar and VLIW)

Static vs. Dynamic Scheduling

C 1. Last Time. CSE 490/590 Computer Architecture. ISAs and MIPS. Instruction Set Architecture (ISA) ISA to Microarchitecture Mapping

Instruction Pipelining Review

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles. Interrupts and Exceptions. Device Interrupt (Say, arrival of network message)

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles

Lecture 14: Multithreading

Advanced Computer Architecture

Adapted from David Patterson s slides on graduate computer architecture

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

EECC551 Exam Review 4 questions out of 6 questions

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Computer Architecture ELEC3441

Administrivia. CMSC 411 Computer Systems Architecture Lecture 6. When do MIPS exceptions occur? Review: Exceptions. Answers to HW #1 posted

Computer Systems Architecture I. CSE 560M Lecture 5 Prof. Patrick Crowley

Hardware-based Speculation

Instruction Level Parallelism (ILP)

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Four Steps of Speculative Tomasulo cycle 0

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

CSC 631: High-Performance Computer Architecture

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Multiple Instruction Issue. Superscalars

E0-243: Computer Architecture

Chapter 4 The Processor 1. Chapter 4D. The Processor

CS 152 Computer Architecture and Engineering

Course on Advanced Computer Architectures

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Multi-cycle Instructions in the Pipeline (Floating Point)

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

MIPS ISA AND PIPELINING OVERVIEW Appendix A and C

Handout 2 ILP: Part B

Transcription:

ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 8 Instruction-Level Parallelism Part 1 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html

ECE252 Administrivia 29 September Homework #2 Due - Use blackboard forum for questions - Attend office hours with questions - Email for separate meetings 4 October Class Discussion Roughly one reading per class. Do not wait until the day before! 1. Srinivasan et al. Optimizing pipelines for power and performance 2. Mahlke et al. A comparison of full and partial predicated execution support for ILP processors 3. Palacharla et al. Complexity-effective superscalar processors 4. Yeh et al. Two-level adaptive training branch prediction ECE 252 / CPS 220 2

Complex Pipelining Pipelining becomes complex when we want high performance in the presence of - Long latency or partially pipelined floating-point units - Memory systems with variable access time - Multiple arithmetic and memory units MIPS Floating Point - Interaction between floating-point (FP), integer datapath defined by ISA - Architect separate register files for floating point (FPR) and integer (GPR) - Define separate load/store instructions for FPR, GPR - Define move instructions between register files - Define FP branches in terms of FP-specific condition codes ECE 252 / CPS 220 3

Floating-Point Unit (FPU) FPU requires much more hardware than integer unit Single-cycle FPU a bad idea - Why? - It is common to have several, different types of FPUs (Fadd, Fmul, etc.) - FPU may be pipelined, partially pipelined, or not pipelined Floating-point Register File (FPR) - To operate several FPUs concurrently, FPR requires several read/write ports ECE 252 / CPS 220 4

Pipelining FPUs fully pipelined 1cyc 1cyc 1cyc partially pipelined 2 cyc 2 cyc Functional units have internal pipeline registers - Inputs to a functional unit (e.g., register file) can change during a long latency operation - Operands are latched when an instruction enters the functional unit ECE 252 / CPS 220 5

Realistic Memory Systems Latency of main memory access usually greater than one cycle and often unpredictable - Solving this problem is a central issue in computer architecture Improving memory performance - Separate instruction and data memory ports, no self-modifying code - Caches -- size L1 cache for single-cycle access - Caches -- L1 miss stalls pipeline - Memory interleaving memory allows multiple simultaneous access - Memory bank conflicts stall the pipeline ECE 252 / CPS 220 6

Multiple Functional Units ALU Mem IF ID Issue WB GPR s FPR s Fadd Fmul Fdiv ECE 252 / CPS 220 7

Complex Pipeline Control Implications of multi-cycle instructions - FPU or memory unit requires more than one cycle - Structural conflict in execution stage, if FPU or memory unit is not pipelined Different functional unit latencies - Structural conflict in writeback stage due to different latencies - Out-of-order write conflicts due to variable latencies How to handle exceptions? ECE 252 / CPS 220 8

Complex In-Order Pipeline PC Inst. Mem D Decode GPRs X1 + X2 Data Mem X3 W Delay writeback so all operations have same latency to writeback stage. FPRs X1 X2 Fadd X3 W Write ports never oversubscribed Every cycle has one instruction in and one instruction out How do we prevent increased writeback latency from slowing down single-cycle integer operations? Forwarding FDiv X2 Fmul X3 X2 Unpipelined divider X3 Commit Point ECE 252 / CPS 220 9

Complex In-Order Pipeline PC Inst. Mem D Decode GPRs X1 + X2 Data Mem X3 W How do we handle data hazards for very long latency operations? Stall pipeline on long latency operations (e.g., divides, cache misses) FPRs X1 X2 Fadd X3 X2 Fmul X3 W Commit Point Exceptions handled in program order at commit point FDiv X2 Unpipelined divider X3 ECE 252 / CPS 220 10

Superscalar In-Order Pipeline PC Inst. Mem 2 D Dual GPRs Decode X1 + X2 Data Mem X3 W Fetch 2 instructions per cycle. Issue both simultaneously if instruction mix matches functional unit mix. FPRs X1 X2 Fadd X3 W Increases instruction throughput. X2 Fmul X3 Commit Point How do we further increase issue width? (a) duplicate functional units, (b) increase register file ports, (c) increase forwarding paths FDiv X2 Unpipelined divider X3 ECE 252 / CPS 220 11

Dependence Analysis Consider executing a sequence of instructions of the form: Rk (Ri) op (Rj) Data Dependence R3 (R1) op (R4) R5 (R3) op (R4) Anti-dependence R3 (R1) op (R2) R1 (R4) op (R5) Output-dependence R3 (R1) op (R2) R3 (R6) op (R7) # RAW hazard (R3) # WAR hazard (R1) # WAW hazard (R3) ECE 252 / CPS 220 12

Detecting Data Hazards Range and Domain of Instruction (j) R(j) = registers (or other storage) modified by instruction j D(j) = registers (or other storage) read by instruction j Suppose instruction k follows instruction j in program order. Executing instruction k before the effect of instruction j has occurred can cause RAW hazard if R(j) D(k) # j modifies a register read by k WAR hazard if D(j) R(k) # j reads a register modified by k WAW hazard if R(j) R(k) # j, k modify the same register ECE 252 / CPS 220 13

Registers vs Memory Dependence Data hazards due to register operands can be determined at decode stage Data hazards due to memory operands can be determined only after computing effective address in execute stage store load M[R1 + disp1] R2 R3 M[R4 + disp2] (R1 + disp1) == (R4 + disp2)? ECE 252 / CPS 220 14

Data Hazards Example I 1 DIVD f6, f6, f4 I 2 LD f2, 45(r3) I 3 MULTD f0, f2, f4 I 4 DIVD f8, f6, f2 I 5 SUBD f10, f0, f6 I 6 ADDD f6, f8, f2 RAW Hazards WAR Hazards WAW Hazards ECE 252 / CPS 220 15

Instruction Scheduling I 1 DIVD f6, f6, f4 I 2 LD f2, 45(r3) I 3 MULTD f0, f2, f4 I 4 DIVD f8, f6, f2 I 5 SUBD f10, f0, f6 I 6 ADDD f6, f8, f2 Valid Instruction Orderings in-order I 1 I 2 I 3 I 4 I 5 I 6 I 1 I 2 I 3 I 4 I 5 out-of-order out-of-order I 2 I 1 I 3 I 4 I 5 I 6 I 1 I 2 I 3 I 5 I 4 I 6 I 6 ECE 252 / CPS 220 16

Out-of-Order Completion I 1 Latency I 1 DIVD f6, f6, f4 4 I 2 LD f2, 45(r3) 1 I 3 MULTD f0, f2, f4 3 I 4 DIVD f8, f6, f2 4 I 5 SUBD f10, f0, f6 1 I 6 ADDD f6, f8, f2 1 I 2 I 3 I 4 I 5 I 6 Let k indicate when instruction k is issued. Let k denote when instruction k is completed. ECE 252 / CPS 220 17

Out-of-Order Completion I 1 Latency I 1 DIVD f6, f6, f4 4 I 2 LD f2, 45(r3) 1 I 3 MULTD f0, f2, f4 3 I 4 DIVD f8, f6, f2 4 I 5 SUBD f10, f0, f6 1 I 6 ADDD f6, f8, f2 1 I 2 I 3 I 4 I 5 I 6 in-order comp 1 2 out-of-order comp 1 2 1 2 3 4 3 5 4 6 5 6 2 3 1 4 3 5 5 4 6 6 ECE 252 / CPS 220 18

Scoreboard Up until now, we assumed user or compiler statically examines instructions, detecting hazards and scheduling instructions Scoreboard is a hardware data structure to dynamically detect hazards ECE 252 / CPS 220 19

Cray CDC6600 Seymour Cray, 1963 - Fast, pipelined machine with 60-bit words - 128 Kword main memory capacity, 32-banks - Ten functional units (parallel, unpipelined) - Floating-point: adder, 2 multipliers, divider - Integer: adder, 2 incrementers - Dynamic instruction scheduling with scoreboard - Ten peripheral processors for I/O -More than 400K transistors, 750 sq-ft, 5 tons, 150kW with novel Freon-based cooling - Very fast clock, 10MHz (FP add in 4 clocks) - Fastest machine in world for 5 years - Over 100 sold ($7-10M each) ECE 252 / CPS 220 20

IBM Memo on CDC6600 Thomas Watson Jr., IBM CEO, August 1963 Last week, Control Data.announced the 6600 system. I understand that in the laboratory developing the system there are only 34 people including the janitor. Of these, 14 are engineers and 4 are programmers Contrasting this modest effort with our vast development activities, I fail to understand why we have lost our industry leadership by letting someone else offer the world s most powerful computer. To which Cray replied It seems like Mr. Watson has answered his own question. ECE 252 / CPS 220 21

Multiple Functional Units ALU Mem IF ID Issue WB GPR s FPR s Previously, resolved write hazards (WAR, WAW) by equalizing pipeline depths and forwarding. Fadd Fmul Is there an alternative? Fdiv ECE 252 / CPS 220 22

Conditions for Instruction Issue When is it safe to issue an instruction? - Suppose a data structure tracks all instructions in all functional units Before issuing instruction, issue logic must check: - Is the required functional unit available? Check for structural hazard. - Is the input data available? Check for RAW hazard. - Is it safe to write the destination? Check for WAR, WAW hazard - Is there a structural hazard at the write back stage? ECE 252 / CPS 220 23

Issue Logic and Data Structure In issue stage, instruction j consults the table - Functional unit available? Check the busy column - RAW? Search the dest column for j s sources - WAR? Search the source column for j s destination - WAW? Search the dest column for j s destination Add entry if no hazard detected, instruction issues Remove entry when instruction writes back Name Busy Op Dest Src1 Src2 Int Mem Add1 Add2 Add3 Mult1 Mult2 Div ECE 252 / CPS 220 24

Simplifying the Data Structure Assume instructions issue in-order Assume issue logic does not dispatch instruction if it detects RAW hazard or busy functional unit Assume functional unit latches operands when the instruction is issued ECE 252 / CPS 220 25

Simplifying the Data Structure Can the dispatched instruction cause WAR hazard? - No, because operands are read at issue and instructions issue in-order No WAR Hazards - No need to track source-1 and source-2 Can the dispatched instruction cause WAW hazard? - Yes, because instructions may complete out-of-order Do not issue instruction in case of WAW hazard - In scoreboard, a register name occurs at most once in dest column ECE 252 / CPS 220 26

Scoreboard Busy[FU#]: a bit-vector to indicate functional unit availability (FU = Int, Add, Mutl, Div) WP[#regs]: a bit-vector to record the registers to which writes are pending - Bits are set to true by issue logic - Bits are set to false by writeback stage - Each functional unit s pipeline registers must carry dest field and a flag to indicate if it s valid: the (we, ws) pair Issue logic checks instruction (opcode, dest, src1, src2) against scoreboard (busy, wp) to dispatch - FU available? Busy[FU#] - RAW? WP[src1] or WP[src2] - WAR? Cannot arise - WAW? WP[dest] ECE 252 / CPS 220 27

Busy-Functional Units Status Int(1) Add(1) Mult(3) Div(4) WB Writes Pending (WP) t0 I 1 f6 f6 t1 I 2 f2 f6 f6, f2 t2 f6 f2 f6, f2 I 2 t3 I 3 f0 f6 f6, f0 t4 f0 f6 f6, f0 I 1 t5 I 4 f0 f8 f0, f8 t6 f8 f0 f0, f8 I 3 t7 I 5 f10 f8 f8, f10 t8 f8 f10 f8, f10 I 5 t9 f8 f8 I 4 t10 I 6 f6 f6 t11 f6 f6 I 6 I 1 DIVD f6, f6, f4 I 2 LD f2, 45(r3) I 3 MULTD f0, f2, f4 I 4 DIVD f8, f6, f2 I 5 SUBD f10, f0, f6 I 6 ADDD f6, f8, f2 Instruction Issue Logic FU available? Busy[FU#] RAW? WP[src1] or WP[src2] WAR? Cannot arise WAW? WP[dest] ECE 252 / CPS 220 28

Scoreboard Detect hazards dynamically Issue instructions in-order Complete instructions out-of-order Increases instruction-level-parallelism by - More effectively exploiting multiple functional units - Reducing the number of pipeline stalls due to hazards ECE 252 / CPS 220 29

Acknowledgements These slides contain material developed and copyright by - Arvind (MIT) - Krste Asanovic (MIT/UCB) - Joel Emer (Intel/MIT) - James Hoe (CMU) - John Kubiatowicz (UCB) - Alvin Lebeck (Duke) - David Patterson (UCB) - Daniel Sorin (Duke) ECE 252 / CPS 220 30