CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Size: px

Start display at page:

Download "CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1"

Clinton Golden
5 years ago
Views:

1 CMCS Advanced Compute Achitectue Lectue 6 Intoduction to Pipelining Septembe 23, Mohamed Younis CMCS 611, Advanced Compute Achitectue 1

2 Pevious Lectue: Lectue s Oveview Type and size of opeands (Famous data types, effect of opeand size on design complexity) Encoding the instuction set (Fixed, vaiable and hybid encoding, the stoe pogam concept) The ole of the compile (Compilation pocess, compile optimization, linking and loading) Effect of ISA on Compile Complexity (Regulaity, Pimitives, not solutions, Simplify tade-offs, Static binding) This Lectue: An oveview of pipelining Pipeline pefomance Pipelined hazads Mohamed Younis CMCS 611, Advanced Compute Achitectue 2

3 Sequential Laundy 6PM Midnight Time T a s k O d e A B C D Washe takes 30 min, Dye takes 40 min, folding takes 20 min Sequential laundy takes 6 hous fo 4 loads If they leaned pipelining, how long would laundy take? * Slide is coutesy of Dave Patteson Mohamed Younis CMCS 611, Advanced Compute Achitectue 3

4 Pipelined Laundy 6PM Midnight Time T a s k O d e A B C D Pipelining means stat wok as soon as possible Pipelined laundy takes 3.5 hous fo 4 loads * Slide is coutesy of Dave Patteson Mohamed Younis CMCS 611, Advanced Compute Achitectue 4

5 Pipelining Lessons T a s k O d e 6 PM Time A B C D Pipelining i doesn t help latency of single task, it helps thoughput of entie wokload Pipeline ate limited by slowest pipeline stage Multiple tasks opeating simultaneously using diffeent esouces Potential speedup = Numbe pipe stages Unbalanced lengths of pipe stages educes speedup Time to fill pipeline and time to dain it educe speedup Stall fo Dependencies * Slide is coutesy of Dave Patteson Mohamed Younis CMCS 611, Advanced Compute Achitectue 5

6 Basics of a RISC Instuction Set RISC achitectues ae chaacteized by the following featues that damatically simplifies the implementation: 1. All opeations apply only on data in egistes 2. Memoy is affected only by load and stoe opeations 3. Instuctions follow vey few fomats and typically ae of the same size All MIPS instuctions ae 32 bits, following one of thee fomats: R-type I-type J-type op s t d shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits op s t immediate 6 bits 5 bits 5 bits 16 bits op taget t addess 6 bits 26 bits * Slide is coutesy of Dave Patteson Mohamed Younis CMCS 611, Advanced Compute Achitectue 6

7 MIPS Instuction fomat Registe-fomat instuctions: op: s: t: d: shmat: funct: op s t d shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits Basic opeation of the instuction, taditionally called opcode The fist egiste souce opeand The second egiste souce opeand The egiste destination opeand, it gets the esult of the opeation Shift amount (explained in futue lectues) This field selects the specific vaiant of the opeation of the op field MIPS assembly language includes two conditional banching instuctions using PC -elative addessing: beq egiste1, egiste2, L1 # go to L1 if (egiste1) = (egiste2) bne egiste1, egiste2, L1 # go to L1 if (egiste1) (egiste2) Examples: add $t2, $ t1, $ t1 # Temp eg $t2 = 2 $t1 sub $t1, $s3, $s4 # Temp eg $t1 = $s3 - $s4 and $t1, $ t2, $ t3 # Temp eg $t1 = $t2. $t bne $s3, $s4, Else # if $s3 $s4 jump to Else Mohamed Younis CMCS 611, Advanced Compute Achitectue 7

8 MIPS Instuction fomat Immediate-type instuctions: op s t addess 6 bits 5 bits 5 bits 16 bits The 16-bit addess means a load wod instuction can load a wod within a egion of ± 2 15 bytes of the addess in the base egiste Examples: lw $t0, 32($s3), sw $t1, 128($s3) MIPS handle 16-bit constant efficiently by including the constant value in the addess field of an I-type instuction (Immediate-type) addi $sp, $sp, 4 #$sp = $sp + 4 Fo lage constants that need moe than 16 bits, a load uppe-immediate (lui) instuction is used to concatenate the second pat lui $t0, Contents of $t0 afte execution Mohamed Younis CMCS 611, Advanced Compute Achitectue 8

9 Addessing in Banches & Jumps I-type instuctions leaves only 16 bits fo addess efeence limiting the size of the jump MIPS banch instuctions use the addess as an incement to the PC allowing the pogam to be as lage as 2 32 (called PC-elative addessing) Since the pogam counte gets incemented pio to instuction execution, the banch addess is actually elative to (PC + 4) MIPS also suppots an J-type instuction fomat fo lage jump instuctions op addess 6 bits 26 bits The 26-bit addess in a J-type instuct. is concatenated to uppe 8 bits of PC Loop: add $t1, $s3, $s3 add $t1, $t1, $t1 add $t1, $t1, $s6 lw $t0, 0($t1) bne $t0, $s5, Exit add $s3, $s3, $s4 j Loop Exit: Mohamed Younis CMCS 611, Advanced Compute Achitectue 9

10 A Simple Implementation of MIPS Mohamed Younis CMCS 611, Advanced Compute Achitectue 10

11 Single-cycle Instuction Execution Mohamed Younis CMCS 611, Advanced Compute Achitectue 11

12 Multi-Cycle Implementation of MIPS Instuction fetch cycle (IF) IR Mem[PC]; NPC PC + 4 Instuction decode/egiste fetch cycle (ID) A Regs[IR ]; B Regs[IR ]; Imm ((IR 16 ) 16 ##IR ) Execution/effective addess cycle (EX) Memoy ef: Output A + Imm; Reg-Reg : Output A func B; Reg-Imm : Output A op Imm; Banch: Output NPC + Imm; Cond (A op 0) Memoy access/banch completion cycle (MEM) Memoy ef: LMD Mem[Output] o Mem(Output] B; Banch: if (cond) PC Output; Wite-back cycle (WB) Reg-Reg : Regs[IR ] Output; Reg-Imm : Load: Regs[IR ] Output; Regs[IR ] LMD; Mohamed Younis CMCS 611, Advanced Compute Achitectue 12

13 Multi-cycle Instuction Execution Mohamed Younis CMCS 611, Advanced Compute Achitectue 13

14 Stages of Instuction Execution Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch Reg/Dec Exec Mem WB The load instuction is the longest All instuctions follows at most the following five steps: Ifetch: Instuction Fetch Fetch the instuction fom the Instuction Memoy and update PC Reg/Dec: Registes Fetch and Instuction Decode Exec: Calculate the memoy addess Mem: Read the data fom the Data Memoy WB: Wite the data back to the egiste file * Slide is coutesy of Dave Patteson Mohamed Younis CMCS 611, Advanced Compute Achitectue 14

15 Instuction Pipelining Stat handling of next instuction while the cuent instuction is in pogess Pipelining is feasible when diffeent devices ae used at diffeent stages of instuction execution Time IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB Pogam Flow IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB Time between instuctions pipelined = Time between instuctions Numbe of pipe stages nonpipelined Pipelining impoves pefomance by inceasing instuction thoughput Mohamed Younis CMCS 611, Advanced Compute Achitectue 15

16 Single Cycle, Multiple Cycle, vs. Pipeline Clk Cycle 1 Cycle 2 Single Cycle Implementation: Load Stoe Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Stoe R-type Ifetch Reg Exec Mem W Ifetch Reg Exec Mem Ifetch Pipeline Implementation: Load Ifetch Reg Exec Mem W Stoe Soe Ifetch ec Reg Exec Mem W R-type Ifetch Reg Exec Mem W * Slide is coutesy of Dave Patteson Mohamed Younis CMCS 611, Advanced Compute Achitectue 16

17 Example of Instuction Pipelining Pogam execution ode Time (in instuctions) lw $1, 100($0) Instuction fetch Reg Data access Reg lw $2, 200($0) lw $3, 300($0) 8ns Time between fist & fouth instuctions is 3 8=24 ns Instuction fetch Reg 8ns Data access Reg Instuction fetch 8ns... Pogam execution Time ode (in instuctions) lw $1, 100($0) lw $2, 200($0) Instuction fetch 2 ns Reg Instuction fetch Reg Data access Reg Data access Reg Time between fist & fouth instuctions is 3 2 = 6 ns lw $3, 300($0) 2 ns Instuction fetch Reg Data access Reg 2ns 2ns 2ns 2ns 2ns Ideal and uppe bound fo speedup is numbe of stages in the pipeline Mohamed Younis CMCS 611, Advanced Compute Achitectue 17

18 Pipeline Pefomance Pipeline inceases the instuction thoughput but does not educe the execution time of the individual instuction Execution time of the individual instuction in pipeline can be slowe due: Additional pipeline contol compaed to none pipeline execution Imbalance among the diffeent pipeline stages Suppose we execute 100 instuctions: Single Cycle Machine 45 ns/cycle x 1 CPI x 100 inst = 4500 ns Multi-cycle Machine 10 ns/cycle x 4.2 CPI (due to inst mix) x 100 inst = 4200 ns Ideal 5 stages pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle dain) = 1040 ns Due to fill and dain effects of a pipeline ideal pefomance can be achieved only fo vey lage instuctions Example: a sequence of 1000 load instuctions would take 5000 cycles on a multi-cycle machine while taking 1004 on a pipeline machine speedup = 5000/ Mohamed Younis CMCS 611, Advanced Compute Achitectue 18

19 Pipeline Datapath Data Stationay Evey stage must be completed in one clock cycle to avoid stalls Values must be latched to ensue coect execution of instuctions The PC multiplexe has moved to the IF stage to pevent two instuctions fom updating the PC simultaneously (in case of banch instuction) Mohamed Younis CMCS 611, Advanced Compute Achitectue 19

20 Pipeline Stage Inteface Stage IF ID EX MEM Any Instuction IF/ID.IR MEM[PC] ; IF/ID.NPC,PC ( if ( (EX/MEM.opcode == banch) & EX/MEM.cond) {EX/MEM.Output } else { PC + 4 } ) ; ID/EX.A = Regs[IF/ID. IR ]; ID/EX.B Regs[IF/ID. IR ]; ID/EX.NPC IF/ID.NPC ; ID/EX.IR IF/ID.IR; 16 ID/EX.Imm (IF/ID. IR 16 ) 16 ## IF/ID. IR ; EX/MEM.IR = ID/EX.IR; EX/MEM. Output ID/EX. A func ID/EX. B; O EX/MEM.Output ID/EX.A op ID/EX.Imm; EX/MEM.cond 0; Load o Stoe Banch MEM/W B.IR EX/MEM.IR; MEM/W B.Output EX/MEM.Output; EX/MEM.IR ID/EX.IR; EX/MEM.Output ID/EX.A + ID/EX.Imm; Imm; EX/MEM.cond 0; EX/MEM.B ID/EX.B; MEM/W B.IR EX/MEM.IR; MEM/W B.LMD Mem[EX/MEM.Output] ; O Mem[EX/MEM.Output] EX/MEM.B ; EX/MEM.Output ID/EX.NPC + ID/EX.Imm; Imm; EX/MEM.cond (ID/EX.A op 0); WB Regs[MEM/W B. IR ] EM/W B.Output; O Regs[MEM/W B. IR ] MEM/W B.Output ; Fo load only: Reg s [MEM/W B. IR ] 15 ] MEM/W B.LMD; Mohamed Younis CMCS 611, Advanced Compute Achitectue 20

21 Pipeline Hazads Pipeline hazads ae cases that affect instuction execution semantics and thus need to be detected and coected Hazads types Stuctual hazad: attempt to use a esouce two diffeent ways at same time E.g., combined washe/dye would be a stuctual hazad o folde busy doing something else (watching TV) Single memoy fo instuction ti and data Data hazad: attempt to use item befoe it is eady E.g., one sock of pai in dye and one in washe; can t fold until get sock fom washe though dye instuction depends on esult of pio instuction still in the pipeline Contol hazad: attempt to make a decision befoe condition is evaluated E.g., washing football unifoms and need to get pope detegent level; need to see afte dye befoe next load in banch instuctions Hazads can always be esolved by waiting * Slide is coutesy of Dave Patteson Mohamed Younis CMCS 611, Advanced Compute Achitectue 21

22 Single Memoy is a Stuctual Hazad Time (clock cycles) I n s t. O d e Load Inst 1 Inst 2 Inst 3 Inst 4 Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg AL LU Mem Reg Mem Reg Can be easily detected Resolved by inseting idle cycles * Slide is coutesy of Dave Patteson Mohamed Younis CMCS 611, Advanced Compute Achitectue 22

23 Stalls & Pipeline Pefomance Speedup fom pipelining = = Aveage instuction time unpipelined Aveage instuction time pipelined CPI unpipelined Clock cycle unpipelined CPI pipelined Clock cycle pipelined Ideally the CPI of the pipeline execution is 1 (afte fill-up), thus CPI pipelined = Ideal CPI + Pipeline stall clock pe instuction = 1 + Pipeline stall clock pe instuction CPI unpipelined Speedup = 1+ Pipelinestall cyclespe instuction Assuming all pipeline stages ae balanced, then Clock cycle unpipelined Clock cycle pipelined 1 Speedup = 1+ Pipeline stall cycles pe instuction Pipeline depth Mohamed Younis CMCS 611, Advanced Compute Achitectue 23

24 I n s t. O d e Data Hazad Time (clock cycles) IF ID/RF EX MEM WB add 1,2,3 sub 4,1,3 and 6,1,7 o 8,1,9 1 9 xo 10,1,11 Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Dependencies backwads in time ae hazads * Slide is coutesy of Dave Patteson Mohamed Younis CMCS 611, Advanced Compute Achitectue 24

25 Data Hazad Solution I n s t. O d e Time (clock cycles) IF ID/RF EX MEM WB add 1,2,3 sub 4,1,3 and 6,1,7 o 8,1,9 xo 10,1,11 Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg AL LU Im Reg Dm Reg Fowad esult fom one stage to anothe * Slide is coutesy of Dave Patteson Mohamed Younis CMCS 611, Advanced Compute Achitectue 25

26 Resolving Data Hazads fo Loads Time (clock cycles) lw 1,0(2) IF ID/RF EX MEM WB AL LU Im Reg Dm Reg sub 4,1,3 AL LU Im Reg Dm Reg Dependencies backwads in time ae hazads Cannot solve with fowading Must delay/stall ll instuction ti dependent d on loads * Slide is coutesy of Dave Patteson Mohamed Younis CMCS 611, Advanced Compute Achitectue 26

27 Contol Hazad Stall: wait until decision is clea Its possible to move up decision to 2 nd stage by adding hadwae to check egistes as being ead I n s t. O d e Add Beq Load Time (clock cycles) Mem Reg Mem Reg Mem Reg Mem Reg Stall Mem Reg Mem Reg Impact: 2 clock cycles pe banch instuction slow * Slide is coutesy of Dave Patteson Mohamed Younis CMCS 611, Advanced Compute Achitectue 27

28 Contol Hazad Solution Pedict: guess one diection then back up if wong I n s t. O d e Pedict not taken Add Beq Load Time (clock cycles) Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg Impact: 1 clock cycles pe banch instuction if ight, 2 if wong (ight - 50% of time) Moe dynamic scheme: histoy of 1 banch (- 90%) * Slide is coutesy of Dave Patteson Mohamed Younis CMCS 611, Advanced Compute Achitectue 28

29 Contol Hazad Solution Redefine banch behavio (takes place afte next instuction) delayed banch I n s t. Add Time (clock cycles) Mem Reg Mem Reg O d e Beq Misc Mem Reg Mem Reg Mem Reg Mem Reg Load Mem AL LU Reg Mem Reg Impact: 0 clock cycles pe banch instuction if can find instuction to put in slot (- 50% of time) * Slide is coutesy of Dave Patteson Mohamed Younis CMCS 611, Advanced Compute Achitectue 29

30 Summay An oveview of Pipelining Conclusion Pipelining concept is natual Stat handling of next instuction while cuent one is in pogess Pipeline pefomance Pefomance impovement by inceasing instuction thoughput Ideal and uppe bound fo speedup is numbe of stages in pipeline Pipelined hazads Stuctual, data and contol hazads Hazad esolution techniques Next Lectue Data and contol Hazads Pipelined contol Reading assignment includes Appendix A.1 & A.2 in the textbook Mohamed Younis CMCS 611, Advanced Compute Achitectue 30

Introduction To Pipelining. Chapter Pipelining1 1

Introduction To Pipelining. Chapter Pipelining1 1 Intoduction To Pipelining Chapte 6.1 - Pipelining1 1 Mooe s Law Mooe s Law says that the numbe of pocessos on a chip doubles about evey 18 months. Given the data on the following two slides, is this tue?