第三章 Instruction-Level Parallelism and Its Dynamic Exploitation. 陈文智 浙江大学计算机学院 2014 年 10 月

Similar documents
Pipeline Review. Review

Pipelining: Hazards Ver. Jan 14, 2014

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

Computer Architecture

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

What is Pipelining? RISC remainder (our assumptions)

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Instruction Pipelining Review

Modern Computer Architecture

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Appendix A. Overview

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

MIPS An ISA for Pipelining

Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation

Advanced Computer Architecture Pipelining

COSC 6385 Computer Architecture - Pipelining

DLX Unpipelined Implementation

Chapter 4 The Processor 1. Chapter 4A. The Processor

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Pipelining. Maurizio Palesi

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

Chapter 4. The Processor

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Appendix C. Abdullah Muzahid CS 5513

COMPUTER ORGANIZATION AND DESIGN

Lecture 2: Processor and Pipelining 1

Processor (II) - pipelining. Hwansoo Han

Pipelining: Basic and Intermediate Concepts

Pipelining. Each step does a small fraction of the job All steps ideally operate concurrently

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Chapter 4. The Processor

Appendix C: Pipelining: Basic and Intermediate Concepts

COMPUTER ORGANIZATION AND DESIGN

Full Datapath. Chapter 4 The Processor 2

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

mywbut.com Pipelining

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

ECEC 355: Pipelining

Outline. Pipelining basics The Basic Pipeline for DLX & MIPS Pipeline hazards. Handling exceptions Multi-cycle operations

C.1 Introduction. What Is Pipelining? C-2 Appendix C Pipelining: Basic and Intermediate Concepts

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

HY425 Lecture 05: Branch Prediction

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

LECTURE 3: THE PROCESSOR

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Computer Architecture Spring 2016

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Advanced Computer Architecture

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Full Datapath. Chapter 4 The Processor 2

Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

Lecture 5: Pipelining Basics

1 Hazards COMP2611 Fall 2015 Pipelined Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4 The Processor 1. Chapter 4B. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

CSE 502 Graduate Computer Architecture. Lec 3-5 Performance + Instruction Pipelining Review

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Pipelining. CSC Friday, November 6, 2015

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

EITF20: Computer Architecture Part2.2.1: Pipeline-1

CSEE 3827: Fundamentals of Computer Systems

CSE 502 Graduate Computer Architecture. Lec 3-5 Performance + Instruction Pipelining Review

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Chapter 4 (Part II) Sequential Laundry

ECE 154A Introduction to. Fall 2012

ECE232: Hardware Organization and Design

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

ECE331: Hardware Organization and Design

Overview of Pipelining

Processor Architecture

CSE 502 Graduate Computer Architecture. Lec 4-6 Performance + Instruction Pipelining Review

ELE 655 Microprocessor System Design

Lecture 7 Pipelining. Peng Liu.

Outline Marquette University

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

Advanced Computer Architecture

ECS 154B Computer Architecture II Spring 2009

Lecture 2: Pipelining Basics. Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4)

Transcription:

第三章 Instruction-Level Parallelism and Its Dynamic Exploitation 陈文智 chenwz@zju.edu.cn 浙江大学计算机学院 2014 年 10 月 1

3.3 The Major Hurdle of Pipelining Pipeline Hazards 本科回顾 ------- Appendix A.2 3.3.1 Taxonomy of hazard 3.3.2 Performance of pipeline with Hazard 3.3.3 Structural hazard 3.3.4 Data Hazards 3.3.5 Control Hazards 2

3.3.1 Taxonomy of hazard A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage Structural hazards These are conflicts over hardware resources. Data hazards Instruction depends on result of prior computation which is not ready (computed or stored) yet Control hazards branch condition and the branch PC are not available in time to fetch an instruction on the next clock 3

Hazards can always be resolved by Stall The simplest way to "fix" hazards is to stall the pipeline. Stall means suspending the pipeline for some instructions by one or more clock cycles. The stall delays all instructions issued after the instruction that was stalled, while other instructions in the pipeline go on proceeding. A pipeline stall is also called a pipeline bubble or simply bubble. No new instructions are fetched during a stall. 4

3.3.2 Performance of pipeline with stalls Recall the speedup formula: 5

Assumptions for calculation The ideal CPI on a pipelined processor is almost always 1. So Ignore the overhead of pipelining clock cycle. Pipe stages are ideal balanced. 6

Clock cycle unpipelined = Clock cycle pipelining CPl unpipelined = pipeline depth 7

3.3.3 Structural hazard Structural hazards Occurs when two or more instructions want to use the same hardware resource in the same cycle Causes bubble (stall) in pipelined machines Overcome by replicating hardware resources Multiple accesses to the register file Multiple accesses to memory some functional unit is not pipelined. Not fully pipelined functional units 8

一 Multi access to the register file Simply insert a stall, speedup will be decreased. We have resolved it with double bump 9

Double Bump Works! 10

二 Multi access to Single Memory Time (clock cycles) I n s t r. Ld/St Instr 1 Mem Reg Mem Reg Mem Reg Mem Reg O r d e r Instr 2 Instr 3 Insert stall Mem Reg Mem Reg Mem provide another memory port split instruction memory and data memory use instruction buffer Reg Mem Reg 11

Insert Stall Time (clock cycles) I n s t r. O r d e r Ld/St Instr 1 Instr 2 Stall Instr 3 Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg Bubble Bubble Bubble Bubble Bubble Mem Reg Mem Reg 12

Split instruction and data memory Time (clock cycles) I n s t r. Ld/St Instr 1 IM Reg DM Reg IM Reg DM Reg O r d e r Instr 2 Instr 3 IM Reg DM Reg IM Reg DM Reg Split instruction and data memory / multiple memory port / instruction buffer means: fetch the instruction and data inference using different hardware resources. 13

三 Not fully pipelined function unit Unpipelined Float Adder ADDD IF ID ADDD WB ADDD IF ID stall stall stall stall stall ADDD Not fully pipelined Adder ADDD IF ID A1 A2 A3 WB ADDD IF ID stall A1 A2 A3 Fully pipelined Adder ADDD IF ID A1 A2 A3 A4 A5 A6 WB ADDD IF ID A1 A2 A3 A4 A5 A6 WB Or multiple unpipelined Float Adder ADDD IF ID ADDD1 WB ADDD IF ID ADDD2 WB 14

四 Why allow machine with structural hazard? To reduce cost. i.e. adding split caches, requires twice the memory bandwidth. also fully pipelined floating point units costs lots of gates. It is not worth the cost if the hazard does not occur very often. To reduce latency of the unit. Making functional units pipelined adds delay (pipeline overhead -> registers.) An unpipelined version may require fewer clocks per operation. Reducing latency has other performance benefits, as we will see. 15

Example: impact of structural hazard to performance Example Many machines have unpipelined float-point multiplier. The function unit time of FP multiplier is 6 clock cycles FP multiply has a frequency of 14% in a SPECfp benchmark Will the structural hzard have a large performance impact on the SPECfp benchmark? 16

Answer to the example In the best case: FP multiplies are distributed uniformly. There is one multiply in every 7 clock. 1/(14%) Then there will be no structural hazard,then there is no performance penalty at all. In the worst case: the multiplies are all clustered with no intervening instructions. Then every multiply instruction have to stall 5 clock cycles to wait for the multiplier be released. The CPI will increase 70% to 1.7, if the ideal CPI is 1. Experiment result: This structural hazard increase execution time by less than 3%. 17

3.3.4 Pipelining Data Hazards Taxonomy of Hazards Structural hazards These are conflicts over hardware resources. Data hazards Instruction depends on result of prior computation which is not ready (computed or stored) yet Control hazards branch condition and the branch PC are not available in time to fetch an instruction on the next clock 18

一 Data hazard Data hazards occur when the pipeline changes the order of read/write accesses to operands comparing with that in sequential executing. Let s see an Example DADD R1, R1, R3 DSUB R4, R1, R5 AND R6, R1, R7 OR XOR R8, R1, R9 R10, R1, R11 19

Data hazard Basic structure An instruction in flight wants to use a data value that s not done yet Done means it s been computed and it s located where I would normally expect to go look in the pipe hardware to find it Basic cause You are used to assuming a purely sequential model of instruction execution Instruction N finishes before instruction N+k, for k >= 1 There are dependencies now between nearby instructions ( near in sequential order of fetch from memory) Consequence Data hazards -- instructions want data values that are not done yet, or in the right place yet 20

Coping with data hazards:example Time ( clock cycle) I n s t r. ADD R1,R2,R3 SUB R4, R1, R5 IM Reg IM R1, read DM R1 w DM Reg. O r d e r AND R6,R1,R7 OR R8,R1,R9 XOR R10,R1,R11 No Hazrd IM R1, read IM R1, read DM IM R1, read 21

二 Somecases Double Bump can do! Time ( clock cycle) I n s t r. ADD R1,R2,R3 SUB R4, R1, R5 IM Reg IM R1, read DM R1 w DM Reg. O r d e r AND R6,R1,R7 OR R8,R1,R9 double bump can do! XOR R10,R1,R11 No Hazard IM R1, read IM R1, read DM IM R1, read 22

三 Proposed solution STALL Proposed solution Don t let them overlap like this? Mechanics Don t let the instruction flow through the pipe In particular, don t let it WRITE any bits anywhere in the pipe hardware that represents REAL CPU state (e.g., register file, memory) Let the instruction wait until the hazard resolved. Name for this operation: PIPELINE STALL 23

How do we stall? Insert nop by compiler Time ( clock cycle) I n s t r. ADD R1,R2,R3 NOP IM Reg DM R1 w Bubble Bubble Bubble Bubble Bubble. O r d e r NOP (ADD R0, R0, R0) SUB R4,R1,R5 double bump can do! AND R6, R1,R7 No Hazard Bubble Bubble Bubble Bubble IM R1, read IM R1, read 24

How do we stall? Add hardware Interlock! Add extra hardware to detect stall situations Watches the instruction field bits Looks for read versus write conflicts in particular pipe stages Basically, a bunch of careful case logic Add extra hardware to push bubbles thru pipe Actually, relatively easy Can just let the instruction you want to stall GO FORWARD through the pipe but, TURN OFF the bits that allow any results to get written into the machine state So, the instruction executes (it does the work), but doesn t save 25

Interlock: insert stalls Time ( clock cycle) I n s t r.. O r d e r ADD R1,R2,R3 DSUB, R4, R1,R5 AND R6,R1,R7 IM Reg IM No Hazard Bubble DM Bubble Empty slots in the pipe called bubbles; means no real instruction work getting saved here R1 w R1, read IM R1, read 26

Detect: Data Hazard Logic Rs =? Rd Rt =? Rd between IF/ID and ID/EX, EX/MEM Stages Rs Rt Rd Rd Rd 27

四 Forwarding If the result you need does not exist AT ALL yet, you are out of luck, sorry. But, what if the result exists, but is not stored back yet? Instead of stalling until the result is stored back in its natural home grab the result on the fly from inside the pipe, and send it to the other instruction (another pipe stage) that wants to use it 28

Forwarding Generic name: forwarding ( bypass, shortcircuiting) Instead of waiting to store the result, we forward it immediately (more or less) to the instruction that wants it Mechanically, we add buses to the datapath to move these values around, and these buses always point backwards in the datapath, from later stages to earlier stages 29

Forwarding: reduce data hazard stalls Data may be already computed - just not in the Register File Time ( clock cycle) I n s t r.. O r d e r ADD R1,R2,R3 SUB R4, R1, R5 AND R6,R1,R7 IM Reg IM R1, read IM R1 DM R1, read R1 R1 w DM Reg DM EX/MEM.output input port MEM/WB.output input port 30

Hardware Change for Forwarding NextPC Registers Immediate ID/EX mux mux EX/MEM Data Memory MEM/WR mux EX/Mem.output input MEM/WB.output input MEM/WB.LMD input 31

Forwarding path to other input entry store MEM/WB.LMD DM input load 32

33

34

Forwarding isn t Always availability 35

So we have to insert stall: Load stall Time (clock cycles) I n s t r. O r d e r lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 Ifetch Reg Ifetch Reg Ifetch DMem Bubble Bubble Bubble Reg Reg Ifetch DMem Reg Reg DMem Reg DMem 36

Solution (without forwarding) 37

Solution (with forwarding) 38

The performance influence of load stall Example Assume 30% of the instructions are loads. Half the time, instruction following a load instruction depends on the result of the load. If hazard causes a single cycle delay, how much faster is the ideal pipeline? Answer CPI = 1+30% 50% 1=1.15 The performance decrease about 15% due to load stall. 39

Fraction of loads that cause a stall Fraction of load that cause a stall 45% 40% 41% 35% 30% 25% 24% 23% 24% 20% 20% 20% 15% 10% 5% 12% 10% 10% 4% 0% 40

五 Instruction reordering by compiler to avoid load stall Try producing fast code for a = b + c; d = e f; assuming a, b, c, d,e, and f in memory. Slow code: LW LW ADD SW LW LW SUB SW Rb,b Rc,c Ra,Rb,Rc a,ra Re,e Rf,f Rd,Re,Rf d,rd Fast code: LW LW LW ADD LW SW SUB SW Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,ra Rd,Re,Rf d,rd 41

3.3.5 Pipelining Control Hazards Taxonomy of Hazards Structural hazards These are conflicts over hardware resources. Data hazards Instruction depends on result of prior computation which is not ready (computed or stored) yet OK, we did these, Double Bump, Forwarding path, software scheduling, otherwise have to stall Control hazards branch condition and the branch PC are not available in time to fetch an instruction on the next clock 42