CS 61C: Great Ideas in Computer Architecture (Machine Structures) Instruc>on Level Parallelism

Size: px

Start display at page:

Download "CS 61C: Great Ideas in Computer Architecture (Machine Structures) Instruc>on Level Parallelism"

Elfrieda Horn
5 years ago
Views:

1 Agenda CS 61C: Geat Ideas in Compute Achitectue (Machine Stuctues) Instuc>on Level Paallelism Instuctos: Randy H. Katz David A. PaJeson hjp://inst.eecs.bekeley.edu/~cs61c/fa10 Review Instuc>on Set Design and Pipelined Execu>on Contol Hazads Administivia Banch Pedic>on Highe Level ILP Summay 11/5/10 Fall Lectue # /5/10 Fall Lectue #29 2 The BIG Pictue Review Pipelining impoves pefomance by inceasing instuc>on thoughput: exploits ILP Executes mul>ple instuc>ons in paallel Each instuc>on has the same latency Subject to hazads Stuctue, data, contol Stalls educe pefomance But ae equied to get coect esults Compile can aange code to avoid hazads and stalls Requies knowledge of the pipeline stuctue Pipelining and ISA Design MIPS Instuc>on Set designed fo pipelining All instuc>ons ae 32- bits Easie to fetch and decode in one cycle x86: 1- to 17- byte instuc>ons (x86 HW actually tanslates to intenal RISC instuc>ons!) Few and egula instuc>on fomats, 2 souce egiste fields always in same place Can decode and ead egistes in one step Memoy opeands only in Loads and Stoes Can calculate addess 3 d stage, access memoy 4 th stage Alignment of memoy opeands Memoy access takes only one cycle 3 4 1

2 Contol Hazads Banch detemines flow of contol Fetching next instuc>on depends on banch outcome Pipeline can t always fetch coect instuc>on S>ll woking on ID stage of banch BEQ, BNE in MIPS pipeline Simple solu>on Op>on 1: Stall on evey banch un>l have new PC value Would add 2 bubbles/clock cycles fo evey Banch! (~ 20% of instuc>ons executed) 5 Stall => 2 bubbles/clocks Time (clock cycles) I n s beq t Inst 1. Inst 2 O Inst 3 d Inst 4 e Whee do we do the compae fo the banch? Contol Hazad: Banching Op>miza>on #1: inset special banch compaato in Stage 2 as soon as instuc>on is decoded (Opcode iden>fies it as a banch), immediately make a decision and set the new value of the PC Benefit: since banch is complete in Stage 2, only one unnecessay instuc>on is fetched, so only one no- op is needed Side Note: This means that banches ae idle in Stages 3, 4 and 5. I n s t. O d e beq Inst 1 Inst 2 Inst 3 Inst 4 One Clock Cycle Stall Time (clock cycles) I$ Banch compaato moved to Decode stage. Reg D$ Reg 2

3 Contol Hazads Op>on 2: Pedict outcome of a banch, fixup if guess wong Must cancel all instuc>ons in pipeline that depended on guess that was wong Simplest hadwae if pedict all banches NOT taken Contol Hazad: Banching Op>on #3: Redefine banches Old defini>on: if we take the banch, none of the instuc>ons ale the banch get executed by accident New defini>on: whethe o not we take the banch, the single instuc>on immediately following the banch gets executed (called the banch- delay slot) The tem Delayed Banch means we always execute inst ale banch This op>miza>on is used with MIPS 11/5/10 Fall Lectue # /5/10 Fall Lectue #29 10 Contol Hazad: Banching Notes on Banch- Delay Slot Wost- Case Scenaio: put a no- op in the banch- delay slot BeJe Case: find instuc>on peceding banch placed in the banch- delay slot without affec>ng flow of pogam Re- odeing instuc>ons is common way to speed up pogams Compile usually finds such an instuc>on 50% of >me Jumps also have a delay slot 11/5/10 Fall Lectue #29 11 Example: Nondelayed vs. Delayed Banch Nondelayed Banch Delayed Banch o $8, $9,$10 add $1,$2,$3 add $1,$2,$3 sub $4, $5,$6 sub $4, $5,$6 beq $1, $4, Exit beq $1, $4, Exit o $8, $9,$10 xo $10, $1,$11 xo $10, $1,$11 Exit: Exit: 3

4 Delayed Banch/Jump and MIPS ISA? Why does JAL put PC+8 in egiste 31? JAL executes following instuc>on (PC+4) so should etun to PC+8 Agenda Review Instuc>on Set Design and Pipelined Execu>on Contol Hazads Administivia Banch Pedic>on Highe Level ILP Summay 11/5/10 Fall Lectue # /5/10 Fall Lectue #29 14 Administivia Poject 3: Thead Level Paallelism + Data Level Paallelism + Cache Op>miza>on Due Pat 2 due Satuday 11/13 Poject 4: Single Cycle Pocesso in Logicsim Due Pat 2 due Satuday 11/27 Face- to- Face gading: Signup fo >me slot in last week Exta Cedit: Fastest Vesion of Poject 3 Due Monday 11/29 Midnight Final Review: TBD (Vote via Suvey!) Please naow what we need to study on eview Final: Mon Dec 13 8AM- 11AM (TBD) 11/5/10 Fall Lectue #27 15 Computes in the News Kinect Pushes Uses Into a Sweaty New Dimension, NY Times, Nov 4, David Pogue (Mo>on tacking) The Kinect s astonishing technology ceates a completely new ac>vity that s social, age- spanning and even athle>c. Micosol owes a huge debt to the Nintendo Wii, yes, but it also deseves huge cedit fo catapul>ng the mo>on- tacking concept into a mind- boggling new dimension. 11/5/10 Fall Lectue #

5 Geate Instuc>on- Level Paallelism (ILP) Deepe pipeline (5 => 10 => 15 stages) Less wok pe stage shote clock cycle Mul>ple issue supescala Replicate pipeline stages mul>ple pipelines Stat mul>ple instuc>ons pe clock cycle CPI < 1, so use Instuc>ons Pe Cycle (IPC) E.g., 4GHz 4- way mul>ple- issue 16 BIPS, peak CPI = 0.25, peak IPC = 4 But dependencies educe this in pac>ce 4.10 Paallelism and Advanced Instuc>on Level Paallelism Mul>ple Issue Sta>c mul>ple issue Compile goups instuc>ons to be issued togethe Packages them into issue slots Compile detects and avoids hazads Dynamic mul>ple issue CPU examines instuc>on steam and chooses instuc>ons to issue each cycle Compile can help by eodeing instuc>ons CPU esolves hazads using advanced techniques at un>me Supescala Laundy: Paallel pe stage T a s k 6 PM AM A B C Time (light clothing) (dak clothing) (vey dity clothing) O D (light clothing) d E (dak clothing) e Moe F (vey dity clothing) esouces, HW to match mix of paallel tasks? Pipeline Depth and Issue Width Intel Pocessos ove Time Micopocesso Yea Clock Rate Pipeline Stages Issue width Coes Powe i MHz W Pentium MHz W Pentium Po MHz W P4 Willamette MHz W P4 Pescott MHz W Coe 2 Conoe MHz W Coe 2 Yokfield MHz W Coe i7 Gulftown MHz W 11/5/10 Fall Lectue #29 19 Chapte 4 The Pocesso 20 5

6 Pipeline Depth and Issue Width Sta>c Mul>ple Issue Clock Powe Pipeline Stages Issue width Coes Compile goups instuc>ons into issue packets Goup of instuc>ons that can be issued on a single cycle Detemined by pipeline esouces equied Think of an issue packet as a vey long instuc>on Specifies mul>ple concuent opea>ons 11/5/10 Fall Lectue # Scheduling Sta>c Mul>ple Issue Compile must emove some/all hazads Reode instuc>ons into issue packets No dependencies with a packet Possibly some dependencies between packets Vaies between ISAs; compile must know! Pad with nop if necessay MIPS with Sta>c Dual Issue Two- issue packets One /banch instuc>on One load/stoe instuc>on 64- bit aligned /banch, then load/stoe Pad an unused instuc>on with nop Addess Instuction type Pipeline Stages n /banch IF ID EX MEM WB n + 4 Load/stoe IF ID EX MEM WB n + 8 /banch IF ID EX MEM WB n + 12 Load/stoe IF ID EX MEM WB n + 16 /banch IF ID EX MEM WB n + 20 Load/stoe IF ID EX MEM WB

7 Hazads in the Dual- Issue MIPS Moe instuc>ons execu>ng in paallel EX data hazad Fowading avoided stalls with single- issue Now can t use esult in load/stoe in same packet add $t0, $s0, $s1 load $s2, 0($t0) Split into two packets, effec>vely a stall Load- use hazad S>ll one cycle use latency, but now two instuc>ons Moe aggessive scheduling equied Scheduling Example Schedule this fo dual- issue MIPS Loop: lw $t0, 0($s1) # $t0=aay element addu $t0, $t0, $s2 # add scala in $s2 sw $t0, 0($s1) # stoe esult addi $s1, $s1, 4 # decement pointe bne $s1, $zeo, Loop # banch $s1!=0 /banch Load/stoe cycle Loop: nop lw $t0, 0($s1) 1 addi $s1, $s1, 4 nop 2 addu $t0, $t0, $s2 nop 3 bne $s1, $zeo, Loop sw $t0, 4($s1) 4 IPC = 5/4 = 1.25 (c.f. peak IPC = 2) Loop Unolling Replicate loop body to expose moe paallelism Reduces loop- contol ovehead Use diffeent egistes pe eplica>on Called egiste enaming Avoid loop- caied an>- dependencies Stoe followed by a load of the same egiste Aka name dependence Reuse of a egiste name Loop Unolling Example /banch Load/stoe cycle Loop: addi $s1, $s1, 16 lw $t0, 0($s1) 1 nop lw $t1, 12($s1) 2 addu $t0, $t0, $s2 lw $t2, 8($s1) 3 addu $t1, $t1, $s2 lw $t3, 4($s1) 4 addu $t2, $t2, $s2 sw $t0, 16($s1) 5 addu $t3, $t4, $s2 sw $t1, 12($s1) 6 nop sw $t2, 8($s1) 7 bne $s1, $zeo, Loop sw $t3, 4($s1) 8 IPC = 14/8 = 1.75 Close to 2, but at cost of egistes and code size

8 Dynamic Mul>ple Issue Supescala pocessos CPU decides whethe to issue 0, 1, 2, each cycle Avoiding stuctual and data hazads Avoids the need fo compile scheduling Though it may s>ll help Code seman>cs ensued by the CPU Dynamic Pipeline Scheduling Allow the CPU to execute instuc>ons out of ode to avoid stalls But commit esult to egistes in ode Example lw $t0, 20($s2) addu $t1, $t0, $t2 subu $s4, $s4, $t3 slti $t5, $s4, 20 Can stat subu while addu is wai>ng fo lw Why Do Dynamic Scheduling? Why not just let the compile schedule code? Not all stalls ae pedicable e.g., cache misses Can t always schedule aound banches Banch outcome is dynamically detemined Diffeent implementa>ons of an ISA have diffeent latencies and hazads Specula>on Guess what to do with an instuc>on Stat opea>on as soon as possible Check whethe guess was ight If so, complete the opea>on If not, oll- back and do the ight thing Common to sta>c and dynamic mul>ple issue Examples Speculate on banch outcome (Banch Pedic>on) Roll back if path taken is diffeent Speculate on load Roll back if loca>on is updated

9 Pipeline Hazad: Matching socks in late load T a s k O d e 6 PM AM A B C D E F bubble A depends on D; stall since folde >ed up; Time 11/5/10 Fall Lectue #29 33 Out- of- Ode Laundy: Don t Wait 6 PM AM T Time a s k A B C bubble O d e D E F A depends on D; est con>nue; need moe esouces to allow out- of- ode 11/5/10 Fall Lectue #29 34 Out Of Ode Intel All use OOO since 2001 Micopocesso Yea Clock Rate Pipeline Stages Issue width Out-of-ode/ Speculation Coes i MHz 5 1 No 1 5W Powe Pentium MHz 5 2 No 1 10W Pentium Po MHz 10 3 Yes 1 29W P4 Willamette MHz 22 3 Yes 1 75W P4 Pescott MHz 31 3 Yes 1 103W Coe MHz 14 4 Yes 2 75W Coe 2 Yokfield MHz 16 4 Yes 4 95W Coe i7 Gulftown MHz 16 4 Yes 6 130W Does Mul>ple Issue Wok? The BIG Pictue Yes, but not as much as we d like Pogams have eal dependencies that limit ILP Some dependencies ae had to eliminate e.g., pointe aliasing Some paallelism is had to expose Limited window size duing instuc>on issue Memoy delays and limited bandwidth Had to keep pipelines full Specula>on can help if done well Chapte 4 The Pocesso

10 And in Conclusion.. Pipeline challenge is hazads Fowading helps w/many data hazads Delayed banch helps with contol hazad in 5 stage pipeline Load delay slot / intelock necessay Moe aggessive pefomance: Longe pipelines Supescala Out- of- ode execu>on Specula>on 11/5/10 Fall Lectue #

CS 61C: Great Ideas in Computer Architecture Instruc(on Level Parallelism: Mul(ple Instruc(on Issue

CS 61C: Great Ideas in Computer Architecture Instruc(on Level Parallelism: Mul(ple Instruc(on Issue CS 61C: Geat Ideas in Compute Achitectue Instuc(on Level Paallelism: Mul(ple Instuc(on Issue Instuctos: Kste Asanovic, Randy H. Katz hbp://inst.eecs.bekeley.edu/~cs61c/fa12 1 Paallel Requests Assigned