Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Similar documents
Hardware-based Speculation

Instruction-Level Parallelism and Its Exploitation

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Case Study IBM PowerPC 620

EECC551 Exam Review 4 questions out of 6 questions

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Hardware-Based Speculation

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

5008: Computer Architecture

Copyright 2012, Elsevier Inc. All rights reserved.

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

CS425 Computer Systems Architecture

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Instruction Level Parallelism

Exploitation of instruction level parallelism

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

EE 4683/5683: COMPUTER ARCHITECTURE

Super Scalar. Kalyan Basu March 21,

CMSC411 Fall 2013 Midterm 2 Solutions

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Static vs. Dynamic Scheduling

Computer Science 246 Computer Architecture

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

Preventing Stalls: 1

ILP: Instruction Level Parallelism

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Metodologie di Progettazione Hardware-Software

Advanced Pipelining and Instruction- Level Parallelism 4

Processor: Superscalars Dynamic Scheduling

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

Chapter 4 The Processor 1. Chapter 4D. The Processor

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Adapted from David Patterson s slides on graduate computer architecture

Four Steps of Speculative Tomasulo cycle 0

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Multiple Instruction Issue and Hardware Based Speculation

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

EECC551 Review. Dynamic Hardware-Based Speculation

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Multi-cycle Instructions in the Pipeline (Floating Point)

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Handout 2 ILP: Part B

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

CS252 Graduate Computer Architecture Midterm 1 Solutions

Floating Point/Multicycle Pipelining in DLX

Processor (IV) - advanced ILP. Hwansoo Han

Lecture-13 (ROB and Multi-threading) CS422-Spring

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Dynamic Control Hazard Avoidance

Instruction-Level Parallelism (ILP)

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Hiroaki Kobayashi 12/21/2004

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

CS433 Homework 2 (Chapter 3)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

CS433 Homework 2 (Chapter 3)

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Hardware Speculation Support

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Instruction Level Parallelism (ILP)

計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Dynamic Scheduling. CSE471 Susan Eggers 1

HY425 Lecture 09: Software to exploit ILP

Hardware-based Speculation

Instruction Level Parallelism (ILP)

HY425 Lecture 09: Software to exploit ILP

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Course on Advanced Computer Architectures

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Getting CPI under 1: Outline

Instruction Level Parallelism. Taken from

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Transcription:

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST

Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism Dynamic scheduling Scoreboarding / Tomasulo approach Hardware branch prediction Branch prediction buffer / Branch target buffer Multiple issue Compiler support for ILP Software pipelining / Trace schduling Hardware support for parallelism Studies of ILP Real example, PowerPC 620 2

ILP : parallelism among instruction sequences How to utilize parallelism Pipeline Multiple issue processor How much parallelism # of instructions in a basic block Branch frequency is about 15~20% Exploit ILP across multiple basic blocks Loop-level parallelism Loop unrolling either statically or dynamically Vector instruction vector processor 3

Dynamic scheduling Scoreboarding / Tomasulo approach Hardware Branch Prediction Branch prediction buffer/ Branch target buffer Multiple Issue Superscalar / VLIW Compiler Support for ILP Software Pipelining / Trace Scheduling Hardware Support for Parallelism Conditional instruction Poison bit / Boosting Tomasulo + Reorder buffer 4

for (i=1 ; i<=1000 ; i++) x[i] = x[i] +s; Loop : LD F0, 0(R1) ; F0 = array element ADDD F4, F0, F2 ; add scalar in F2 SD 0(R1), F4 ; store result SUBI R1, R1, 8 ; decrement pointer ; 8 Byte (per DW) BNEZ R1, Loop ; Branch R1!= zero clock cycle issued Loop : LD F0, 0(R1) 1 Stall 2 ADDD F4, F0, F2 3 Stall 4 Stall 5 SD 0(R1), F4 6 SUBI R1, R1, #8 7 BNEZ R1, Loop 8 Stall 9 scheduling Loop : LD F0, 0(R1) Stall ADDD F4, F0, F2 SUBI R1, R1, #8 BNEZ R1, Loop ; delayed branch SD 8(R1), F4 ; altered & interchanged with SUBI 5

Unrolling 4 iterations Loop : LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 ; drop SUBI & BNEZ LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 ; drop SUBI & BNEZ LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 ; drop SUBI & BNEZ LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 ; drop SUBI & BNEZ SUBI R1, R1, #32 BNEZ R1, Loop scheduling Loop : LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD 0(R1), F4 SD -8(R1), F8 SD -16(R1), F12 SUBI R1, R1, #32 BNEZ R1, Loop SD 8(R1), F16 ; 8-32 = -24 6

To reduce loop overhead Eliminate branches Improve scheduling Allow instructions from different iterations to be scheduled Exposes more computations that can be scheduled to minimize stalls Different registers for each iteration Increases the required count of registers When the upper bound of the loop is not known k copies of the loop body (n mod k) + (n / k) iterations 7

Data dependence (RAW) Data dependent instructions can not execute simultaneously hazard stall is a property of the pipeline To avoid Maintain the dependence but avoid a harzard Data forwarding Scheduling = code rearrangement Eliminate the dependence by transforming the code Data dependence that flows through memory is difficult to detect. 8

Name dependence (WAR, WAW) There is no value being transmitted between the instructions Two types : anti-dependence (WAR) output dependence (WAW) Register renaming Either by compiler or in hardware Only true dependence remains Control dependence Every instruction except for the first basic block is control dependent on some set of branches An instruction dependent on a branch cannot be moved before or after the branch To avoid it, we can do delayed branch, speculation, and loop unrolling 9

Loop carried dependence Dependence exists between different iterations for ( i = 1 ; i <= 100; i++) x[i] = x[i] + s; Loop is parallel for ( i = 1 ; i <= 100; i++){ a[i] = a[i] + b[i]; b[i+1] = c[i] +d[i]; } a[1] = a[1] + b[1]; for ( i = 1 ; i <= 99; i++){ b[i+1] = c[i] +d[i] a[i+1] = a[i+1] + b[i+1]; } b[101] = c[100] + d[100]; b[i] : loop carried dependence 10

Static scheduling : compiler techniques to reduce hazards and stalls Dynamic scheduling : hardware rearranges instruction execution to reduce stalls Out-of-order execution, out-of-order completion Instructions begin execution as soon as their operands are available Imprecise execution ID stage is split into two stages Issue : decode instructions, check for structural hazards Read operands : wait until no data hazards, then read operands Two techniques Scoreboarding Tomasulo approach (Reservation Station) 11

12

Used in CDC6600 The goal is to execute an instruction as early as possible Instructions can be issued and executed if they do not dependent on any active or stalled instruction Performance improvement of 1.7 for FORTRAN programs and 2.5 for hand-coded assembly programs Scoreboard All hazard detection and resolution is centralized Constructs a record of data dependences Monitors every change in the hardware Determines when the instruction can read operands and begin execution Controls when the instruction can write its result into the destination 13

Issue Check for structural and WAW hazards If so, issue stalls Read operands Read source operands when available and begin execution Resolves RAW hazards dynamically No forwarding of data Execute Notify its completion to the scoreboard Write result Checks for WAR hazards, and stalls the instruction if necessary 14

15

16

Amount of parallelism available among the instructions Parallelism within a basic block Scoreboard size Instruction window : the set of instructions examined as candidates for potential execution The number and types of functional units The presence of anti-dependence and output dependence 17

Used by IBM360/91 floating point unit [1967] Scoreboarding + Register renaming Eliminates WAW and WAR hazards by register renaming Reservation station Located in front of each functional unit to hold an instruction issued Hazard detection and execution control are distributed Fetches and buffers an operand as soon as available The result is directly passed to reservation stations waiting for it Common data bus (CDB) 18

19

Issue Get an instruction from instruction queue Issue if there is an empty reservation station Sends the operands if they are in registers, otherwise a tag field denoting the reservation station that will produce the operand If no empty reservation station, stall until a station is free Execute If operands are not yet available, monitor CDB while waiting for the register to be computed When the operand becomes available, place it into reservation station When both operands are available, execute the operation Write result Write it into CDB, into registers, and any reservation stations waiting for the result 20

21

22

Dynamic prediction at run time The prediction will change if the branch changes its behavior while the program is running Predicts branch direction and branch target address Usually implemented based on the branch prediction buffers 23

A small memory indexed by low portion of the address of the branch instruction Aliasing between branch instructions having the same low portion Accessed with the instruction address during IF stage Contains previous branch history One bit history mis-predicts two times for a change Actual : T T T T T T N T T T Predict :? T T T T T T N T T N-bit saturating up-down counter 2-bit counter is enough for almost all applications 24

25

26

Integer programs have higher branch frequency but lower prediction accuracy than FP programs Buffer hit rate is not the limiting factor Increasing number of bits per predictor has little impact Correlating predictor = Two-level predictor The current branch prediction is heavily affected by the direction of the last branches. If ( d == 0 ) d=1; If ( d == 1 ) branch branch 27

If ( d == 0 ) d=1; If ( d == 1 ) BNEZ R1, L1 ; branch b1 (d!=0) ADDI R1, R0, #1 ; d==0, so d=1 L1: SUBI R3, R1, #1 BNEZ R3, L2 ; branch b2 (d!=1) L2: 28

Two-level Branch address + direction history of last branches (m,n) predictor m : last m branches are used Usually m > 10 The direction history is called a pattern n : n-bit counter Usually n = 2 Pattern-based predictor, which does not use branch address to index the predictor, produces quite good results gshare : widely used (branch address XOR history pattern) is used to index the predictor 29

30

31

BTB A cache that stores the predicted address for the next instruction after a branch Access during the IF stage, that is, before decoding the instruction Must know the instruction to be fetched is a branch Store only predict-taken branches Must know whether the fetched instruction is a taken branch For 2-bit counter predictor, use both a target buffer and a prediction buffer 32

33

34

Store one or more target instructions instead of target PC Branch folding if the only function of branch is to change PC Return stack The accuracy of predicting the target address for a return instruction is low Indirect jump : destination address varies at runtime Majority of the indirect jumps come from procedure returns Small stack for return addresses Caches most recent return addresses 35

36

Superscalar Issue a varying number of independent instructions per clock Satisfy some hardware constraints Simplest organization : one integer + one floating-point More ports in register file More difficult in a CISC whose instruction length is variable The effect of stalls or delay is more severe than scalar machines Statically scheduled by compiler Dynamically scheduled by hardware Issue two or more instructions to reservation stations Pipeline the issue stage so that it runs two or more times faster For load/store instructions, we do not want out-of-order execution Queue Cf. Decoupled architecture 37

VLIW (Very Long Instruction Word) Issue a fixed number of instructions formatted as one large instruction or as a fixed instruction packet Extremely difficult to determine in hardware whether multiple instructions are independent An efficient compiler is essential to make a long sequence of instructions that can be executed in parallel Limitations Large number of functional units; much larger than the number of instructions in a VLIW Large number of ports on the register file and memory Large memory bandwidth Instructions are not full Waste of instruction bits and functional units Binary code compatibility 38

Parallel Loop Recurrence : a variable is defined based on the value of the variable in an earlier iteration Dependence distance The larger the distance, the more potential parallelism Array Assume indices are affine : affine if index = a*i+b For two indices, ai+b and cj+d, These is no dependence if GCD(a,c) does not divide (d-b) Renaming Dependence analysis is challenging Pointers, indirect indexing, Lacks of runtime information: Possible dependence, but not encountered in runtime. 39

Reorganizing loops such that each iteration in the software-pipelined code is made from instructions chosen from different iterations of the original loop Interleaves instructions from different iterations without unrolling the loop Prolog / Start-up code Epilog / Clean-up code 40

41

42

43

Find parallelism across branches Run frequent paths faster Trace selection Likely sequence of basic blocks Trace compaction Global code scheduling Code motion speculative execution Often insert compensation code (or book-keeping code) to off-trace code to ensure correctness One example of compensation code : inverse operation 44

45

Conditional instruction If condition is true, the instruction is executed normally, otherwise the execution continues as if a NOP. Conditional move instructions are employed in recent processors 46

Compiler speculation Ignore executions for speculative instructions Simply return an undefined value for any exception that would cause termination Still need the renaming in software Poison bits A poison bit is added to every register, and a bit is added to every instruction to indicate whether the instruction is speculative Renaming in software Boosting Providing the renaming and buffering in hardware, much as Tomasulo s algolithm Record (boosted + predicted branch direction) The result of boosted instructions are forwarded to and used by other boosted instructions 47

48

49

50

Hardware based Speculation Dynamic branch prediction + speculation + dynamic scheduling Advantages Disambiguate memory reference Hardware branch prediction is superior Completely precise exception Does not require compensation Good performance for different implementation of an architecture binary compatibility Tomasulo s algolithm + Reorder buffer 51

52

Issue (dispatch) Issue if there is an empty reservations station and an empty slot in reorder buffer Execute Monitor CDB while waiting for the register to be computed, and then execute Write Result Write the result on CDB and into reorder buffer Commit Update the register with the result when the instruction reaches the head of the reorder buffer If mis-predicted branch reached the head of the reorder buffer, the reorder buffer is flushed and execution is started at the correct successor of the branch. 53

54

55

Ideal / Perfect Processor Perfect register renaming Perfect branch prediction Perfect jump instruction Perfect memory-address alias analysis Unlimited issue : Enough functional units to allow the ready instructions to issue Look arbitrarily far ahead to find instructions to issue One cycle execution 56

57

Limitations on the window size 2K window + issue of 64 instructions Realistic branch and jump prediction + aggressive predictor with 8K entries + 2K jump and return predictors Finite renaming registers + 256 renaming registers Imperfect alias analysis 58

59

60

2K window + issue of 64 instructions 61

62

2K window + issue of 64 instructions + aggressive predictor with 8K entries + 2K jump and return predictors 63

64

2K window + issue of 64 instructions + aggressive predictor with 8K entries + 2K jump and return predictors + 256 renaming registers 65

66

64 instruction issues A selective predictor with 1K entries and 16 entry return predictor Perfect disambiguation of memory references done dynamically Register renaming with additional 64 integer registers and 64 FP registers No cache misses + unit latencies 67

68

69

64 bit advanced superscalar processor Can fetch, issue, complete up to 4 instructions per cycle Speculative execution past 4 unresolved branches Register renaming 8 extra integer registers, 8 extra FP registers Reservation stations with 16-entry reorder buffer Execution unit Six execution unit 3 integer units with 2 reservation stations each Two simple integer units, XSU0, XSU1 One complex integer function unit, XCFXU 1 branch unit, BPU, with 2 reservation stations 1 Load/Store unit, LSU with 3 reservation stations 1 FP unit, FPU 70

8 entry instruction queue Static / Dynamic Branch prediction Branch prediction in fetch and dispatch stages 256-entry 2-way branch target address cache 2048-entry branch history table Caches 32KB 8-way set associative non-blocking data cache 32KB 8-way set associative instruction cache BUS interface 40-bit address bus and 128 bit data bus Split transaction Pipelined snoop bus protocol MESI 71

72

73

74

Branch misprediction 256 entry BTB, 2 way set associative 2K entry branch prediction buffer Return stack Instruction cache miss Not serious because of a perfect off-chip cache Partial cache line fill 75

76

No reservation station available No rename registers Reorder buffer is full The same functional unit Misc. Shortage of read ports Special registers serialization 77

78

79

80

81

Source operand unavailable ILP is insufficient fewer buffers Functional unit unavailable Increase the number of FUs Increase pipelining in the un-pipelined units Out-of-order disallowed Serialization Overall 1.limitation of FU LSU 2.losses in fetch, issue, and execution 3.ILP limitation and finite buffering 82

83