CS 61C: Great Ideas in Computer Architecture Instruc(on Level Parallelism: Mul(ple Instruc(on Issue

Size: px

Start display at page:

Download "CS 61C: Great Ideas in Computer Architecture Instruc(on Level Parallelism: Mul(ple Instruc(on Issue"

Raymond Gyles Strickland
5 years ago
Views:

CS 61C: Geat Ideas in Compute Achitectue Instuc(on Level Paallelism: Mul(ple Instuc(on Issue Instuctos: Kste Asanovic, Randy H. Katz hbp://inst.eecs.bekeley.

g., 5 pipelined instucvons Paallel Data >1 data item @ one Vme e.g., Add of 4 pais of wods Hadwae descipvons All gates @ one Vme Pogamming Languages You Ae Hee!

Compute (Cache) Coe FuncVonal Unit(s) A 0 +B 0 A 1 +B 1 A 2 +B 2 A 3 +B 3 The BIG Pictue Review Pipelining impoves pefomance by inceasing instucvon thoughput: exploits

esults Compile can aange code to avoid hazads and stalls Requies knowledge of the pipeline stuctue Agenda Moe ILP InstucVon Scheduling Administivia Out of Ode ExecuVon

InstucVon- Level Paallelism (ILP) Deepe pipeline (5 => 10 => 15 stages) Less wok pe stage shote clock cycle MulVple issue supescala Replicate pipeline stages mulvple

1 CS 61C: Geat Ideas in Compute Achitectue Instuc(on Level Paallelism: Mul(ple Instuc(on Issue Instuctos: Kste Asanovic, Randy H. Katz hbp://inst.eecs.bekeley.edu/~cs61c/fa12 1 Paallel Requests Assigned to compute e.g., Seach Katz Paallel Theads Assigned to coe e.g., Lookup, Ads So7wae Paallel InstucVons >1 one Vme e.g., 5 pipelined instucvons Paallel Data >1 data one Vme e.g., Add of 4 pais of wods Hadwae descipvons All one Vme Pogamming Languages You Ae Hee! Haness Paallelism & Achieve High Pefomance Hadwae Today s Lectue Waehouse Scale Compute Coe Memoy Input/Output InstucVon Unit(s) Main Memoy Coe Smat Phone Logic Gates 2 Compute (Cache) Coe FuncVonal Unit(s) A 0 +B 0 A 1 +B 1 A 2 +B 2 A 3 +B 3 The BIG Pictue Review Pipelining impoves pefomance by inceasing instucvon thoughput: exploits ILP Executes mulvple instucvons in paallel Each instucvon has the same latency Subject to hazads Stuctue, data, contol Stalls educe pefomance But ae equied to get coect esults Compile can aange code to avoid hazads and stalls Requies knowledge of the pipeline stuctue Agenda Moe ILP InstucVon Scheduling Administivia Out of Ode ExecuVon Paallelism Big Pictue And, in Conclusion, 3 4 Agenda Moe ILP InstucVon Scheduling Administivia Out of Ode ExecuVon Paallelism Big Pictue And, in Conclusion, Geate InstucVon- Level Paallelism (ILP) Deepe pipeline (5 => 10 => 15 stages) Less wok pe stage shote clock cycle MulVple issue supescala Replicate pipeline stages mulvple pipelines Stat mulvple instucvons pe clock cycle CPI < 1, so use InstucVons Pe Cycle (IPC) E.g., 4GHz 4- way mulvple- issue 16 BIPS, peak CPI = 0.25, peak IPC = 4 But dependencies educe this in pacvce 5 6 1

2 MulVple Issue StaVc mulvple issue Compile goups instucvons to be issued togethe Packages them into issue slots Compile detects and avoids hazads Dynamic mulvple issue CPU examines instucvon steam and chooses instucvons to issue each cycle Compile can help by eodeing instucvons CPU esolves hazads using advanced techniques at unvme Supescala Laundy: Paallel pe stage T a s k 6 PM AM A B C Time (light clothing) (dak clothing) (vey dity clothing) O D (light clothing) d E (dak clothing) e F (vey dity clothing) Moe esouces, HW to match mix of paallel tasks? 7 8 Pipeline Depth and Issue Width Intel Pocessos ove Time Micopocesso Yea Clock Rate Pipeline Stages Issue width Coes Powe i MHz W Pentium MHz W Pentium Po MHz W P4 Willamette MHz W P4 Pescott MHz W Coe 2 Conoe MHz W Coe 2 Yokfield MHz W Coe i7 Gulftown MHz W Pipeline Depth and Issue Width Clock 1000 Powe 100 Pipeline Stages Issue width 10 Coes StaVc MulVple Issue Compile goups instucvons into issue packets Goup of instucvons that can be issued on a single cycle Detemined by pipeline esouces equied Think of an issue packet as a vey long instucvon Specifies mulvple concuent opeavons Scheduling StaVc MulVple Issue Compile must emove some/all hazads Reode instucvons into issue packets No dependencies with a packet Possibly some dependencies between packets Vaies between ISAs; compile must know! Pad with nop if necessay

3 MIPS with StaVc Dual Issue Two- issue packets One ALU/banch instucvon One load/stoe instucvon 64- bit aligned ALU/banch, then load/stoe Pad an unused instucvon with nop Addess Instuction type Pipeline Stages n ALU/banch IF ID EX MEM WB n + 4 Load/stoe IF ID EX MEM WB n + 8 ALU/banch IF ID EX MEM WB n + 12 Load/stoe IF ID EX MEM WB n + 16 ALU/banch IF ID EX MEM WB n + 20 Load/stoe IF ID EX MEM WB Hazads in the Dual- Issue MIPS Moe instucvons execuvng in paallel EX data hazad Fowading avoided stalls with single- issue Now can t use ALU esult in load/stoe in same packet add $t0, $s0, $s1 load $s2, 0($t0) Split into two packets, effecvvely a stall Load- use hazad SVll one cycle use latency, but now two instucvons Moe aggessive scheduling equied Scheduling Example Schedule this fo dual- issue MIPS Loop: lw $t0, 0($s1) # $t0=aay element addu $t0, $t0, $s2 # add scala in $s2 sw $t0, 0($s1) # stoe esult addi $s1, $s1, 4 # decement pointe bne $s1, $zeo, Loop # banch $s1!=0 ALU/banch Load/stoe cycle Loop: nop lw $t0, 0($s1) 1 addi $s1, $s1, 4 nop 2 addu $t0, $t0, $s2 nop 3 bne $s1, $zeo, Loop sw $t0, 4($s1) 4 n IPC = 5/4 = 1.25 (c.f. peak IPC = 2) Loop Unolling Replicate loop body to expose moe paallelism Reduces loop- contol ovehead Use diffeent egistes pe eplicavon Called egiste enaming Avoid loop- caied anv- dependencies Stoe followed by a load of the same egiste Aka name dependence Reuse of a egiste name Loop Unolling Example ALU/banch Load/stoe cycle Loop: addi $s1, $s1, 16 lw $t0, 0($s1) 1 nop lw $t1, 12($s1) 2 addu $t0, $t0, $s2 lw $t2, 8($s1) 3 addu $t1, $t1, $s2 lw $t3, 4($s1) 4 addu $t2, $t2, $s2 sw $t0, 16($s1) 5 addu $t3, $t4, $s2 sw $t1, 12($s1) 6 nop sw $t2, 8($s1) 7 bne $s1, $zeo, Loop sw $t3, 4($s1) 8 IPC = 14/8 = 1.75 Close to 2, but at cost of egistes and code size Dynamic MulVple Issue Supescala pocessos CPU decides whethe to issue 0, 1, 2, each cycle Avoiding stuctual and data hazads Avoids the need fo compile scheduling Though it may svll help Code semanvcs ensued by the CPU

4 Dynamic Pipeline Scheduling Allow the CPU to execute instucvons out of ode to avoid stalls But commit esult to egistes in ode Example lw $t0, 20($s2) addu $t1, $t0, $t2 subu $s4, $s4, $t3 slti $t5, $s4, 20 Can stat subu while addu is waivng fo lw Why Do Dynamic Scheduling? Why not just let the compile schedule code? Not all stalls ae pedicable e.g., cache misses Can t always schedule aound banches Banch outcome is dynamically detemined Diffeent implementavons of an ISA have diffeent latencies and hazads Agenda Moe ILP InstucVon Scheduling Administivia Out of Ode ExecuVon Paallelism Big Pictue And, in Conclusion, 21 Administivia As of today, made 1 pass ove all Big Ideas in Compute Achitectue Following lectues go into moe depth on topics you ve aleady seen while you wok on pojects 1 lectue in moe depth on Caches 1 on C stoage management 1 on Dependability/Reliability/Redundancy 1 on PotecVon, Vitual Memoy, Vitual Machines 1 on ExcepVons, Taps, Inteupts 2 on Moden Phone and Compute Achitectues 1 on Pogamming Contest (Exta Cedit Poject 5) 1 on Couse Wap- up and Review 22 Agenda Moe ILP InstucVon Scheduling Administivia Out of Ode ExecuVon Paallelism Big Pictue And, in Conclusion, SpeculaVon Guess what to do with an instucvon Stat opeavon as soon as possible Check whethe guess was ight If so, complete the opeavon If not, oll- back and do the ight thing Common to stavc and dynamic mulvple issue Examples Speculate on banch outcome (Banch PedicVon) Roll back if path taken is diffeent Speculate on load Roll back if locavon is updated

Pipeline Hazad: Matching socks in late load T a s k O d e 6 PM 7 8 9 10 11 12 1 2 AM A B C D E F 30 30 30 30 30 30 30 bubble A depends on D; stall since folde Ved up; Time 25 Out- of- Ode Laundy: Don

Basically, unoll loops in hadwae 1. Fetch instucvons in pogam ode ( 4/clock) 2. Pedict banches as taken/untaken 3.

5 Pipeline Hazad: Matching socks in late load T a s k O d e 6 PM AM A B C D E F bubble A depends on D; stall since folde Ved up; Time 25 Out- of- Ode Laundy: Don t Wait 6 PM AM T Time a s k A B C bubble O d e D E F A depends on D; est convnue; need moe esouces to allow out- of- ode 26 Out- of- Ode ExecuVon (1/2) Basically, unoll loops in hadwae 1. Fetch instucvons in pogam ode ( 4/clock) 2. Pedict banches as taken/untaken 3. To avoid hazads on egistes, ename egistes using a set of intenal egistes (~80 egistes) 4. CollecVon of enamed instucvons might execute in a window (~60 instucvons) 5. Execute instucvons with eady opeands in 1 of mulvple func(onal units (ALUs, FPUs, Ld/St) 27 Out- of- Ode ExecuVon (2/2) Basically, unoll loops in hadwae 6. Buffe esults of executed instucvons unvl pedicted banches ae esolved in eode buffe 7. If pedicted banch coectly, commit esults in pogam ode 8. If pedicted banch incoectly, discad all dependent esults and stat with coect PC 28 Out Of Ode Intel All use OOO since 2001 AMD Opteon X4 Micoachitectue 72 physical egistes Micopocesso Yea Clock Rate Pipeline Stages Issue width Out-of-ode/ Speculation Coes i MHz 5 1 No 1 5W Powe Pentium MHz 5 2 No 1 10W Pentium Po MHz 10 3 Yes 1 29W P4 Willamette MHz 22 3 Yes 1 75W P4 Pescott MHz 31 3 Yes 1 103W Coe MHz 14 4 Yes 2 75W Coe 2 Yokfield MHz 16 4 Yes 4 95W Coe i7 Gulftown MHz 16 4 Yes 6 130W Chapte 4 The Pocesso

AMD Opteon X4 Pipeline Flow Fo intege opeavons Dynamically Scheduled CPU Peseves dependencies Hold pending opeands n n 12 stages (FloaVng Point is 17 stages) Up to 106 RISC- ops in pogess n Intel

instucvons Results also sent to any waivng esevavon stavons 31 32 Does MulVple Issue Wok?

ms have eal dependencies that limit ILP Some dependencies ae had to eliminate e.g.

on Paallelism Two types of paallelism in applica(ons 1. Data- Level Paallelism (DLP): aises because thee ae many data items that can be opeated on at the same Vme 2.

6 AMD Opteon X4 Pipeline Flow Fo intege opeavons Dynamically Scheduled CPU Peseves dependencies Hold pending opeands n n 12 stages (FloaVng Point is 17 stages) Up to 106 RISC- ops in pogess n Intel Nehalem is 16 stages fo intege opeavons, details not evealed, but likely simila to above+ n Intel calls RISC opeavons Mico opeavons o μops Reodes buffe fo egiste wites Can supply opeands fo issued instucvons Results also sent to any waivng esevavon stavons Does MulVple Issue Wok? The BIG Pictue Yes, but not as much as we d like Pogams have eal dependencies that limit ILP Some dependencies ae had to eliminate e.g., pointe aliasing Some paallelism is had to expose Limited window size duing instucvon issue Memoy delays and limited bandwidth Had to keep pipelines full SpeculaVon can help if done well Big Pictue on Paallelism Two types of paallelism in applica(ons 1. Data- Level Paallelism (DLP): aises because thee ae many data items that can be opeated on at the same Vme 2. Task- Level Paallelism (TLP): aises because tasks of wok ae ceated that can opeate lagely in paallel Fall Lectue # Big Pictue on Paallelism Hadwae can exploit app Data LP and Task LP in 4 ways: 1. Instuc(on- Level Paallelism: Hadwae exploits applicavon DLP using ideas like pipelining and speculavve execuvon 2. SIMD achitectues: exploit app DLP by applying a single instucvon to a collecvon of data in paallel 3. Thead- Level Paallelism: exploits eithe app DLP o TLP in a Vghtly- coupled hadwae model that allows fo inteacvon among paallel theads 4. Request- Level Paallelism: exploits paallelism among lagely decoupled tasks and is specified by the pogamme of the opeavng system And in Conclusion.. Pipeline challenge is hazads Fowading helps w/many data hazads Delayed banch helps with contol hazad in 5 stage pipeline Load delay slot / intelock necessay Moe aggessive pefomance: Longe pipelines Supescala Out- of- ode execuvon SpeculaVon

7 Pee QuesVon State if following techniques ae associated pimaily with a softwae- o hadwae-based appoach to exploiting ILP (in some cases, the answe may be both): Supescala, Out-of- Ode execution, Speculation, Registe Renaming Supe- scala Out of Ode Specu- lavon Registe Renaming Oange HW HW HW HW Geen HW HW Both Both Pink HW HW HW Both HW HW HW SW Pee QuesVon State if following techniques ae associated pimaily with a softwae- o hadwae-based appoach to exploiting ILP (in some cases, the answe may be both): Supescala, Out-of- Ode execution, Speculation, Registe Renaming Supe- scala Out of Ode Specu- lavon Registe Renaming Oange HW HW HW HW Geen HW HW Both Both Pink HW HW HW Both HW HW HW SW Pee InstucVon Inst LP, SIMD, Thead LP, Request LP ae examples of Paallelism above ( ) the InstucVon Set Achitectue Paallelism explicitly at (=) the level of the ISA Paallelism below ( ) the level of the ISA Inst. LP SIMD Th. LP Req. LP Oange = = Geen = = Pink = = Pee InstucVon Inst LP, SIMD, Thead LP, Request LP ae examples of Paallelism above ( ) the InstucVon Set Achitectue Paallelism explicitly at (=) the level of the ISA Paallelism below ( ) the level of the ISA Inst. LP SIMD Th. LP Req. LP Oange = = Geen = = Pink = = Pee InstucVon Pee Answe I. Thanks to pipelining, I have educed the time it took me to wash my one shit. II. Longe pipelines ae always a win (since less wok pe stage & a faste clock). I. Thanks to pipelining, I have educed the Vme it took me to wash my one shit. II. Longe pipelines ae always a win (since less wok pe stage & a faste clock). A)(oange) I is Tue and II is Tue B)(geen) I is False and II is Tue C)(pink) I is Tue and II is False Red Oange Geen I is Tue and II is Tue I is False and II is Tue I is Tue and II is False

8 Pee QuesVon Not all instucvons ae acvve in evey stage of the 5- stage pipeline. Ignoing the effects of hazads, which of the following is tue? 1. Allowing jumps, banches, and ALU instucvons to take fewe stages than the 5 equied by the load instucvon will incease pipeline pefomance fo most pogams. 2. You cannot make ALU instucvons take fewe cycles because of the wite back of the esult, but banches and jumps can take fewe cycles, so thee is some oppotunity fo impovement. 3. Instead of tying to make instucvons take fewe cycles, we should exploe making the pipeline longe, so that instucvons take moe cycles, but the cycles ae shote. This could impove pefomance. 4. The numbe of pipe stages pe instucvon affects thoughput, not latency. Oange: 1 Geen: 2 Pink: 3 Pee Answe Not all instucvons ae acvve in evey stage of the 5- stage pipeline. Ignoing the effects of hazads, which of the following is tue? 1. Allowing jumps, banches, and ALU instucvons to take fewe stages than the 5 equied by the load instucvon will incease pipeline pefomance fo most pogams. 2. You cannot make ALU instucvons take fewe cycles because of the wite back of the esult, but banches and jumps can take fewe cycles, so thee is some oppotunity fo impovement. 3. Instead of tying to make instucvons take fewe cycles, we should exploe making the pipeline longe, so that instucvons take moe cycles, but the cycles ae shote. This could impove pefomance. 4. The numbe of pipe stages pe instucvon affects thoughput, not latency. Oange: 1 Geen: 2 Pink: And in Conclusion, Big Ideas of InstucVon Level Paallelism Pipelining, Hazads, and Stalls Fowading, SpeculaVon to ovecome Hazads MulVple issue to incease pefomance IPC instead of CPI Dynamic ExecuVon: Supescala in- ode issue, banch pedicvon, egiste enaming, out- of- ode execuvon, in- ode commit unoll loops in HW, hide cache misses 45 8

You Are Here! Review: Hazards. Agenda. Agenda. Review: Load / Branch Delay Slots 7/28/2011

You Are Here! Review: Hazards. Agenda. Agenda. Review: Load / Branch Delay Slots 7/28/2011 CS 61C: Geat Ideas in Compute Achitectue (Machine Stuctues) Instuction Level Paallelism: Multiple Instuction Issue Guest Lectue: Justin Hsia Softwae Paallel Requests Assigned to compute e.g., Seach Katz