Lecture 8: Data Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University

Similar documents
Lecture 10: Pipelined Implementations: Hazards and Resolutions. Instruction Pipeline Reality

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12

T = I x CPI x C. Both effective CPI and clock cycle C are heavily influenced by CPU design. CPI increased (3-5) bad Shorter cycle good

CS3350B Computer Architecture Quiz 3 March 15, 2018

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

What do we have so far? Multi-Cycle Datapath (Textbook Version)

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S.

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

Lecture 7 Pipelining. Peng Liu.

ECE260: Fundamentals of Computer Engineering

Chapter 4 The Processor 1. Chapter 4B. The Processor

Chapter 4 (Part II) Sequential Laundry

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

EECS 322 Computer Architecture Improving Memory Access: the Cache

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

Chapter 4 The Processor 1. Chapter 4A. The Processor

The University of Alabama in Huntsville Electrical & Computer Engineering Department CPE Test II November 14, 2000

cs470 - Computer Architecture 1 Spring 2002 Final Exam open books, open notes

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Lecture 3: Single Cycle Microarchitecture. James C. Hoe Department of ECE Carnegie Mellon University

Improve performance by increasing instruction throughput

Design of Digital Circuits Lecture 16: Dependence Handling. Prof. Onur Mutlu ETH Zurich Spring April 2017

Lecture 6: Microprogrammed Multi Cycle Implementation. James C. Hoe Department of ECE Carnegie Mellon University

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution

COMPUTER ORGANIZATION AND DESIGN

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Assignment 1 solutions

ECE473 Computer Architecture and Organization. Pipeline: Data Hazards

ECE/CS 552: Pipelining

Processor Design Pipelined Processor (II) Hung-Wei Tseng

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

COMP2611: Computer Organization. The Pipelined Processor

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content

Pipeline design. Mehran Rezaei

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Processor Design CSCE Instructor: Saraju P. Mohanty, Ph. D. NOTE: The figures, text etc included in slides are borrowed

Chapter 4. The Processor

ECE/CS 552: Pipeline Hazards

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

Lecture 6: Pipelining

Binvert Operation (add, and, or) M U X

DEE 1053 Computer Organization Lecture 6: Pipelining

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 13 EE141

Very Simple MIPS Implementation

COSC 6385 Computer Architecture - Pipelining

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

CSE 378 Midterm 2/12/10 Sample Solution

CS61C : Machine Structures

ECE154A Introduction to Computer Architecture. Homework 4 solution

PS Midterm 2. Pipelining

Final Exam Spring 2017

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

LECTURE 9. Pipeline Hazards

CS232 Final Exam May 5, 2001

COMPUTER ORGANIZATION AND DESIGN

Very Simple MIPS Implementation

CS420/520 Homework Assignment: Pipelining

ECE260: Fundamentals of Computer Engineering

Full Datapath. Chapter 4 The Processor 2

Lecture 19 Introduction to Pipelining

Computer Architecture. Lecture 6.1: Fundamentals of

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Chapter Six. Dataı access. Reg. Instructionı. fetch. Dataı. Reg. access. Dataı. Reg. access. Dataı. Instructionı fetch. 2 ns 2 ns 2 ns 2 ns 2 ns

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Processor (II) - pipelining. Hwansoo Han

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Processor Architecture

Computer Architecture

CSEN 601: Computer System Architecture Summer 2014

Pipelined datapath Staging data. CS2504, Spring'2007 Dimitris Nikolopoulos

CS 351 Exam 2 Mon. 11/2/2015

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Computer Organization and Structure

Outline Marquette University

Full Datapath. CSCI 402: Computer Architectures. The Processor (2) 3/21/19. Fengguang Song Department of Computer & Information Science IUPUI

ECE Exam II - Solutions November 8 th, 2017

LECTURE 3: THE PROCESSOR

(Basic) Processor Pipeline

ECE232: Hardware Organization and Design

ECE473 Computer Architecture and Organization. Processor: Combined Datapath

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 6 Pipelining Part 1

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control

Pipelining. lecture 15. MIPS data path and control 3. Five stages of a MIPS (CPU) instruction. - factory assembly line (Henry Ford years ago)

Processor Design Pipelined Processor. Hung-Wei Tseng

CS 251, Winter 2018, Assignment % of course mark

zhandling Data Hazards The objectives of this module are to discuss how data hazards are handled in general and also in the MIPS architecture.

ECE 3056: Architecture, Concurrency, and Energy of Computation. Sample Problem Sets: Pipelining

CS 230 Practice Final Exam & Actual Take-home Question. Part I: Assembly and Machine Languages (22 pts)

EECS150 - Digital Design Lecture 10- CPU Microarchitecture. Processor Microarchitecture Introduction

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

Advanced Computer Architecture Pipelining

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

TDT4255 Friday the 21st of October. Real world examples of pipelining? How does pipelining influence instruction

Transcription:

18 447 Lecture 8: Data Hazard and Resolution James C. Hoe Department of ECE Carnegie ellon University 18 447 S18 L08 S1, James C. Hoe, CU/ECE/CALC, 2018

Your goal today Housekeeping detect and resolve data hazards in in order pipelines Notices Lab 2, status check next week, due wk of 2/26 HW 2,due 2/21 **Office Hours: 11~12 and F 1:30~2:30** Readings P&H Ch 4 18 447 S18 L08 S2, James C. Hoe, CU/ECE/CALC, 2018

Instruction Pipeline Reality Not identical tasks coalescing instruction types into one multifunction pipe external fragmentation (some idle stages) Not uniform suboperations group or sub divide steps into stages to minimize variance internal fragmentation (some too fast stages ) Not independent tasks dependency detection and resolution next lecture(s) Even more messy if not RISC 18 447 S18 L08 S3, James C. Hoe, CU/ECE/CALC, 2018

Data Dependence Data dependence r 3 r 1.... op r 2 Read after Write (RAW) r 5 r 3 op r 4 Anti dependence r 3 r 1.... op r 2 Write after Read (WAR) r 1 r 4 op r 5 Output dependence r 3 r 1 op r 2 Write after Write (WAW).... r 3 r 6 op r 7 Don t forget memory instructions 18 447 S18 L08 S4, James C. Hoe, CU/ECE/CALC, 2018

RAW Dependency and Hazard addi ra r addi r ra addi r ra addi r ra addi r ra addi r ra t 0 t 1 t 2 t 3 t 4 t 5 IF ID EX E WB IF ID EX E WB IF ID EX E IF ID EX IF ID IF 18 447 S18 L08 S5, James C. Hoe, CU/ECE/CALC, 2018

Register Data Hazard Analysis R/I Type LW SW Bxx Jal Jalr IF ID read RF read RF read RF read RF read RF EX E WB write RF write RF write RF write RF For a given pipeline, when is there a register data hazard between 2 dependent instructions? dependence type: RAW, WAR, WAW? instruction types involved? distance between the two instructions? 18 447 S18 L08 S6, James C. Hoe, CU/ECE/CALC, 2018

Hazard in In order Pipeline j: _ r k RF Read stage X j: r k _ RF Write j: r k _ RF Write stage Y i: r k _ RF Write i: _ r k RF Read i: r k _ RF Write RAW Hazard WAR Hazard WAW Hazard dist dependence (i,j) dist hazard (X,Y)?? Hazard!! dist dependence (i,j) > dist hazard (X,Y)?? Safe 18 447 S18 L08 S7, James C. Hoe, CU/ECE/CALC, 2018

RAW Hazard Analysis Example R/I Type LW SW Bxx Jal Jalr IF ID read RF read RF read RF read RF read RF EX E WB write RF write RF write RF write RF Older I A and younger I B have RAW hazard iff I B (R/I, LW, SW, Bxx or JALR) reads a register written by I A (R/I, LW, or JAL/R) dist(i A, I B ) dist(id, WB) = 3 What about WAW and WAR hazard? What about memory data hazard? 18 447 S18 L08 S8, James C. Hoe, CU/ECE/CALC, 2018

Pipeline Stall: universal hazard resolution t 0 t 1 t 2 t 3 t 4 t 5 Inst h IF ID ALU E WB Inst i i IF ID ALU E WB Inst j j IF ID ALU ID E ALU ID E ALU WB ID E ALU WB Inst k IF ID IF ALU ID IF E ALU ID IF E ALU WB ID Inst l IF ID IF ALU ID IF E ALU ID IF IF ID IF ALU ID IF i: r x _ bubble j: _ r IF ID x dist(i,j)=1 IF j: bubble _ r x dist(i,j)=2 IF j: bubble _ r x dist(i,j)=3 j: _ r x dist(i,j)=4 18 447 S18 L08 S9, James C. Hoe, CU/ECE/CALC, 2018 Stall==make younger instruction wait until hazard passes 1. stop all up stream stages 2. drain all down stream stages

What should happen in this case? t 0 t 1 t 2 t 3 t 4 t 5 Inst h IF ID ALU E WB Inst i i IF ID ALU E WB Inst j j IF ID ALU E WB Inst k k IF ID ALU E WB Inst l IF ID ALU E IF ID ALU i: r x _ j: r IF ID y r z k: _ r x dist(i,k)=2 IF 18 447 S18 L08 S10, James C. Hoe, CU/ECE/CALC, 2018

Pipeline Stall t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 IF i j k k k k l ID h i j j j j k l EX h i bub bub bub j k l E h i bub bub bub j k l WB h i bub bub bub j k l 18 447 S18 L08 S11, James C. Hoe, CU/ECE/CALC, 2018 i: rx _ j: _ rx

Stall PCSrc 0 u x 1 stall Control ID/EX WB EX/E WB E/WB IF/ID EX WB Add PC PC 4 Address Instruction memory Instruction RegWrite Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data Shift left 2 0 u x 1 Add Add result ALUSrc Zero ALU ALU result Branch Write data emwrite Address Data memory Read data emtoreg 1 u x 0 Instruction [15 0] 16 Sign 32 extend 6 ALU control emread Stall IR disable PC and IR latching setregwrite ID =0 and emwrite ID =0 18 447 S18 L08 S12, James C. Hoe, CU/ECE/CALC, 2018 Instruction [20 16] Instruction [15 11] 0 u x 1 RegDst ALUOp Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]

Stall Condition Older I A and younger I B have RAW hazard iff I B (R/I, LW, SW, Bxx or JALR) reads a register written by I A (R/I, LW, or JAL/R) dist(i A, I B ) dist(id, WB) = 3 ore plainly, before I B in ID reads a register, I B needs to check if any I A in EX, E or WB is going to update it (if so, value in RF is stale ) 18 447 S18 L08 S13, James C. Hoe, CU/ECE/CALC, 2018 Watch out for x0!!

Stall Condition Helper functions use_rs1(i) returns true if I uses rs1 && rs1!=x0 Stall IF and ID when (rs1 ID ==rd EX ) && use_rs1(ir ID ) && RegWrite EX (rs1 ID ==rd E ) && use_rs1(ir ID ) && RegWrite E (rs1 ID ==rd WB ) && use_rs1(ir ID ) && RegWrite WB (rs2 ID ==rd EX ) && use_rs2(ir ID ) && RegWrite EX (rs2 ID ==rd E ) && use_rs2(ir ID ) && RegWrite E or or or or or (rs2 ID ==rd WB ) && use_rs2(ir ID ) && RegWrite WB 18 447 S18 L08 S14, James C. Hoe, CU/ECE/CALC, 2018 It is crucial that EX, E and WB continue to advance during stall

Impact of Stall on Performance Each stall cycle corresponds to 1 lost ALU cycle A program with N instructions and S stall cycles: average IPC=N/(N+S) S depends on frequency of hazard causing dependencies distance between hazard causing instruction pairs distance between hazard causing dependencies (suppose i 1,i 2 and i 3 all depend on i 0, once i 1 s hazard is resolved by stalling, i 2 and i 3 do not stall) 18 447 S18 L08 S15, James C. Hoe, CU/ECE/CALC, 2018

Sample Assembly [P&H] for (j=i 1; j>=0 && v[j] > v[j+1]; j =1) {... } 18 447 S18 L08 S16, James C. Hoe, CU/ECE/CALC, 2018 addi $s1, $s0, 1 for2tst: slti $t0, $s1, 0 bne $t0, $zero, exit2 sll $t1, $s1, 2 add $t2, $a0, $t1 lw $t3, 0($t2) lw $t4, 4($t2) slt $t0, $t4, $t3 beq $t0, $zero, exit2... addi $s1, $s1, 1 j for2tst exit2: 3 stalls 3 stalls 3 stalls 3 stalls 3 stalls 3 stalls

Data Forwarding (or Register Bypassing) What does ADD rx ry rz mean? Get inputs from RF[ry] and RF[rz] and put result in RF[rx]? But, RF is just a part of an abstraction a way to connect dataflow between instructions inputs to ADD are resulting values of the last instructions to assign to RF[ry] and RF[rz] RF doesn t have to exist as an literal object If only dataflow matters, don t wait for WB... add ra r r IF ID EX E WB addi r ra r IF ID EX ID E ID WB ID 18 447 S18 L08 S17, James C. Hoe, CU/ECE/CALC, 2018

Resolving RAW Hazard by Forwarding A hazard exits Older I A and younger I B have RAW hazard iff I B (R/I, LW, SW, Bxx or JALR) reads a register written by I A (R/I, LW, or JAL/R) dist(i A, I B ) dist(id, WB) = 3 ore plainly, before I B in ID reads a register, I B needs to check if any I A in EX, E or WB is going to update it (if so, value in RF is stale ) Before: I B need to stall for RF to update Now: I B need to stall for I A to produce result retrieve I A result from datapath when ready must retrieve from youngest if multiple hazards 18 447 S18 L08 S18, James C. Hoe, CU/ECE/CALC, 2018

Forwarding Paths (v1) dist(i,j)=3 Registers internal forward? dist(i,j)=3 b. With forwarding ID/EX Rs Rt Rt Rd Rd Rs Rt u x u x ForwardB u x ForwardA ID/EX EX/E EX/E E/WB E/WB ALU ID/EX.RegisterRD Forwarding unit dist(i,j)=1 Data memory EX/E.RegisterRd EX/E.RegisterRD E/WB.RegisterRd E/WB.RegisterRD dist(i,j)=2 u x 18 447 S18 L08 S19, James C. Hoe, CU/ECE/CALC, 2018 [Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]

Forwarding Paths (v2) dist(i,j)=3 ID/EX EX/E E/WB u x Registers u x ForwardA ALU dist(i,j)=1 Data memory dist(i,j)=2 u x Rs Rt Rt Rd Rd ForwardB u x Forwarding unit EX/E.RegisterRd E/WB.RegisterRd better if EX is the fastest stage 18 447 S18 L08 S20, James C. Hoe, CU/ECE/CALC, 2018 [Based on original figure from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]

Forwarding Logic (for v1) if (rs1 ID!=0) && (rs1 ID ==rd EX ) && RegWrite EX then forward writeback value from EX // dist=1 else if (rs1 ID!=0) && (rs1 ID ==rd E ) && RegWrite E then forward writeback value from E // dist=2 else if (rs1 ID!=0) && (rs1 ID ==rd WB ) && RegWrite WB then forward writeback value from WB // dist=3 else use A ID // dist > 3 18 447 S18 L08 S21, James C. Hoe, CU/ECE/CALC, 2018 ust check in right order Why doesn t use_rs1( ) appear? Isn t it bad to forward from LW in EX?

Data Hazard Analysis (with Forwarding) IF R/I Type LW SW Bxx Jal Jalr ID EX use produce use use use produce E produce (use) use produce WB Even with forwarding, RAW dependence on immediate preceding LW results in hazard Stall = { [(rs1 ID ==rd EX ) && use_rs1(ir ID )] 18 447 S18 L08 S22, James C. Hoe, CU/ECE/CALC, 2018 i.e., op EX =Lx [(rs2 ID ==rd EX ) && use_rs2(ir ID )] } && emread EX

IPS Load Delay Slot Feature I 1 : LW ra IF ID EX E WB I 2 : addi r ra r I 3 : addi r ra r IF ID EX E WB IF ID EX E WB R2000 defined LW with arch. latency of 1 inst invalid for I 2 (in LW s delay slot) to ask for LW s result any dependence on LW at least distance 2 Delay slot vs dynamic stalling fill with an independent instruction (no difference) if not, fill with a NOP (no difference) Can t lose on 5 stage... good idea? 18 447 S18 L08 S23, James C. Hoe, CU/ECE/CALC, 2018 Hint: 1. non atomic instruction; 2. arch influence

Sample Assembly [P&H] for (j=i 1; j>=0 && v[j] > v[j+1]; j =1) {... } 18 447 S18 L08 S24, James C. Hoe, CU/ECE/CALC, 2018 addi $s1, $s0, 1 for2tst: slti $t0, $s1, 0 bne $t0, $zero, exit2 sll $t1, $s1, 2 add $t2, $a0, $t1 lw $t3, 0($t2) lw $t4, 4($t2) slt $t0, $t4, $t3 beq $t0, $zero, exit2... addi $s1, $s1, 1 j for2tst exit2: 1 stall or 1 nop (IPS)

Dependency Terminology ordering requirement between instructions Pipeline Hazard: (potential) violation of dependencies Hazard Resolution: static schedule instructions at compile time to avoid hazards dynamic detect hazard and adjust pipeline operation Stall, Flush or Forward Pipeline Interlock (i.e., stall) 18 447 S18 L08 S25, James C. Hoe, CU/ECE/CALC, 2018

Dividing into Stages 200ps 100ps 200ps 200ps 100ps IF: Instruction fetch ux 0 1 ID: Instruction decode/ register file read EX: Execute/ address calculation E: emory access WB: Write back ignore for today Add 4 Shift left 2 Add result Add PC Address Instruction memory Instruction Read register 1 Read data 1 Read register 2 Registers Read data 2 Write register Write data 0 ux 1 Zero ALU ALU result Address Data memory Write data Read data ux 1 0 RF write 16 Sign extend 32 Is this the correct partitioning? Why not 4 or 6 stages? Why not different boundaries 18 447 S18 L08 S26, James C. Hoe, CU/ECE/CALC, 2018 Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]

Why not very deep pipelines? With only 5 stages, still plenty of combinational logic between registers Superpipelining increase pipelining such that even intrinsic operations (e.g. ALU, RF access, memory access) require multiple stages What s the problem? Inst 0 : r1 r2 + r3 Inst 1 : r4 r1 + 2 t 0 t 0 t 1 t 1 t 2 t 2 t 3 t 3 t 4 t 4 t 5 t 5 Inst 0 F a F F b D a D D b E a E E b a b W a W b Inst 1 F a F b F D a D b DE a E ba EE ba ba W ba W ba W b F D E W F a F b D a D b DE ab E ab E ba ab W ab W ab W b D b 18 447 S18 L08 S27, James C. Hoe, CU/ECE/CALC, 2018

Intel P4 s Superpipelined Adder Hack A lower B lower 16 bit add S lower A upper B upper 16 bit add S upper EX 1 EX 2 32 bit addition pipelined over 2 stages, BW=1/latency 16 bit add No stall between back to back dependencies 18 447 S18 L08 S28, James C. Hoe, CU/ECE/CALC, 2018

When you can t split a stage... @(rate=1/t) I @(rate=2/t) d e A (rate=1/t) d e 0.5T clock B (rate=1/t) T delay 0.5T clock O @(rate=2/t) 0.5T clock 18 447 S18 L08 S29, James C. Hoe, CU/ECE/CALC, 2018

Dependencies and Pipelining (architecture vs. microarchitecture) Sequential and atomic instruction semantics i 1 i 2 True dependence between two instructions may only require ordering of certain sub operations i 1 : i 2 : i 3 Defines what is correct; doesn t say do it this way 18 447 S18 L08 S30, James C. Hoe, CU/ECE/CALC, 2018 i 3 :