You Are Here! Review: Hazards. Agenda. Agenda. Review: Load / Branch Delay Slots 7/28/2011

Size: px

Start display at page:

Download "You Are Here! Review: Hazards. Agenda. Agenda. Review: Load / Branch Delay Slots 7/28/2011"

Shon Hines
6 years ago
Views:

1 CS 61C: Geat Ideas in Compute Achitectue (Machine Stuctues) Instuction Level Paallelism: Multiple Instuction Issue Guest Lectue: Justin Hsia Softwae Paallel Requests Assigned to compute e.g., Seach Katz Paallel Theads Assigned to coe e.g., Lookup, Ads Paallel Instuctions >1 one time e.g., 5 pipelined instuctions Paallel Data >1 data one time e.g., Add of 4 pais of wods Hadwae desciptions All gates functioning in paallel at same time You Ae Hee! Haness Paallelism & Achieve High Pefomance Hadwae Today s Lectue Waehouse Scale Compute Coe Memoy Cache Compute Coe (Cache) Input/Output Coe Instuction Unit(s) Functional Unit(s) A 0 +B 0 A 1 +B 1 A 2 +B 2 A 3 +B 3 Smat Phone Logic Gates 2 Review: Hazads Situations that pevent stating the next instuction in the next clock cycle 1. Stuctual hazads A equied esouce is busy (e.g., needed in multiple stages) 2. Data hazad Data dependency between instuctions. Need to wait fo pevious instuction to complete its data ead/wite 3. Contol hazad Flow of execution depends on pevious instuction 3 4 Review: Load / Banch Delay Slots Stall is equivalent to nop lw $t0, 0($t1) nop sub $t3,$t0,$t2 and $t5,$t0,$t4 ALU I$ Reg D$ Reg bub bub bub bub bub ble ble ble ble ble o $t7,$t0,$t6 I$ ALU I$ Reg D$ Reg ALU I$ Reg D$ Reg ALU Reg D$ 5 6 1

2 Geate Instuction-Level Paallelism (ILP) Deepe pipeline (5 => 10 => 15 stages) Less wok pe stage shote clock cycle Multiple issue is supescala Replicate pipeline stages multiple pipelines Stat multiple instuctions pe clock cycle CPI < 1, so use Instuctions Pe Cycle (IPC) E.g., 4 GHz 4-way multiple-issue 16 BIPS, peak CPI = 0.25, peak IPC = 4 But dependencies educe this in pactice 4.10 Paallelism and Advanced Inst uction Level Paallelism Multiple Issue Static multiple issue Compile goups instuctions to be issued togethe Packages them into issue slots Compile detects and avoids hazads Dynamic multiple issue CPU examines instuction steam and chooses instuctions to issue each cycle Compile can help by eodeing instuctions CPU esolves hazads using advanced techniques at untime 7 8 Supescala Laundy: Paallel pe stage 6 PM AM Time T a A (light clothing) s (dak clothing) k B C (vey dity clothing) O D (light clothing) d E (dak clothing) e F (vey dity clothing) Moe esouces, HW to match mix of paallel tasks? Pipeline Depth and Issue Width Intel Pocessos ove Time Micopocesso Yea Clock Rate Pipeline Stages Issue width Coes Powe i MHz W Pentium MHz W Pentium Po MHz W P4 Willamette MHz W P4 Pescott MHz W Coe 2 Conoe MHz W Coe 2 Yokfield MHz W Coe i7 Gulftown MHz W 9 10 Pipeline Depth and Issue Width Static Multiple Issue Clock Powe Pipeline Stages Issue width Coes Compile goups instuctions into issue packets Goup of instuctions that can be issued on a single cycle Detemined by pipeline esouces equied Think of an issue packet as a vey long instuction (VLIW) Specifies multiple concuent opeations

3 Scheduling Static Multiple Issue Compile must emove some/all hazads Reode instuctions into issue packets No dependencies with a packet Possibly some dependencies between packets Vaies between ISAs; compile must know! Pad with nop if necessay MIPS with Static Dual Issue Dual-issue packets One ALU/banch instuction One load/stoe instuction 64-bit aligned ALU/banch, then load/stoe Pad an unused instuction with nop Addess Instuction type Pipeline Stages n ALU/banch IF ID EX MEM WB n + 4 Load/stoe IF ID EX MEM WB n + 8 ALU/banch IF ID EX MEM WB n + 12 Load/stoe IF ID EX MEM WB n + 16 ALU/banch IF ID EX MEM WB n + 20 Load/stoe IF ID EX MEM WB Hazads in the Dual-Issue MIPS Moe instuctions executing in paallel EX data hazad Fowading avoided stalls with single-issue Now can t use ALU esult in load/stoe in same packet add $t0, $s0, $s1 load $s2, 0($t0) Split into two packets, effectively a stall Load-use hazad Still one cycle use latency, but now two instuctions Moe aggessive scheduling equied Scheduling Example Schedule this fo dual-issue MIPS Loop: lw $t0, 0($s1) # $t0=aay element addu $t0, $t0, $s2 # add scala in $s2 sw $t0, 0($s1) # stoe esult addi $s1, $s1, 4 # decement pointe bne $s1, $zeo, Loop # banch $s1!=0 ALU/banch Load/stoe cycle Loop: nop lw $t0, 0($s1) 1 addi $s1, $s1, 4 nop 2 addu $t0, $t0, $s2 nop 3 bne $s1, $zeo, Loop sw $t0, 4($s1) 4 IPC = 5/4 = 1.25 (c.f. peak IPC = 2) Loop Unolling Replicate loop body to expose moe paallelism Use diffeent egistes pe eplication Called egiste it enaming Avoid loop-caied anti-dependencies Stoe followed by a load of the same egiste Aka name dependence Reuse of a egiste name Loop Unolling Example ALU/banch Load/stoe cycle Loop: addi $s1, $s1, 16 lw $t0, 0($s1) 1 nop lw $t1, 12($s1) 2 addu $t0, $t0, $s2 lw $t2, 8($s1) 3 addu $t1, $t1, $s2 lw $t3, 4($s1) 4 addu $t2, $t2, $s2 sw $t0, 16($s1) 5 addu $t3, $t4, $s2 sw $t1, 12($s1) 6 nop sw $t2, 8($s1) 7 bne $s1, $zeo, Loop sw $t3, 4($s1) 8 IPC = 14/8 = 1.75 Close to 2, but at cost of egistes and size

4 Administivia Poject 2 Pat 2 due Sunday. Slides at end of July 12 lectue contain useful info. Lab 12 cancelled! Replaced with fee study session whee you can catch up on labs / wok on poject 2. The TAs will still be thee. Poject 3 will be posted late Sunday (7/31) Two-stage pipelined CPU in Logisim Dynamic Multiple Issue Supescala pocessos CPU decides whethe to issue 0, 1, 2, instuctions each cycle Avoiding stuctual t and ddata hazads Avoids need fo compile scheduling Though it may still help Code semantics ensued by the CPU Dynamic Pipeline Scheduling Allow the CPU to execute instuctions out of ode to avoid stalls But commit esult to egistes in ode Example lw $t0, 20($s2) addu $t1, $t0, $t2 subu $s4, $s4, $t3 slti $t5, $s4, 20 Can stat subu while addu is waiting fo lw Why Do Dynamic Scheduling? Why not just let the compile schedule? Not all stalls ae pedicable e.g., cache misses Can t always schedule aound banches Banch outcome is dynamically detemined Diffeent implementations of an ISA have diffeent latencies and hazads 23 7/28/2011 Summe Lectue #

5 Speculation Guess what to do with an instuction Stat opeation as soon as possible Check whethe guess was ight If so, complete the opeation If not, oll-back and do the ight thing Common to both static and dynamic multiple issue Examples Speculate on banch outcome (Banch Pediction) Roll back if path taken is diffeent Speculate on load Roll back if location is updated 25 T a s k O d e Pipeline Hazad: Matching socks in late load 6 PM AM A bubble B C D E F A depends on D; stall since folde tied up; Time 26 Out-of-Ode Laundy: Don t Wait T a s k 6 PM AM A B C D bubble Time O d E e F A depends on D; est continue; need moe esouces to allow out-of-ode 27 Out-of-Ode Execution (1/3) Basically, unoll loops in hadwae 1. Fetch instuctions in pogam ode ( 4/clock) 2. Pedict banches as taken/untaken 3. To avoid hazads on egistes, ename egistes using a set of intenal egistes (~80 egistes) 28 Out-of-Ode Execution (2/3) Basically, unoll loops in hadwae 4. Collection of enamed instuctions might execute in a window (~60 instuctions) 5. Execute instuctions with eady opeands in 1 of multiple functional units (ALUs, FPUs, Ld/St) 6. Buffe esults of executed instuctions until pedicted banches ae esolved in eode buffe 29 Out-of-Ode Execution (3/3) Basically, unoll loops in hadwae 7. If pedicted banch coectly, commit esults in pogam ode 8. If pedicted banch incoectly, discad all dependent esults and stat with coect PC 30 5

Dynamically Scheduled CPU Banch pediction, Registe enaming Peseves

opeands available Micopocesso Yea Clock Rate Pipeline Stages Issue width

66MHz 5 2 No 1 10W Pentium Po 1997 200MHz 10 3 Yes 1 29W P4 Willamette 2001

wites Can supply opeands fo issued instuctions Results also sent to any

2930MHz 14 4 Yes 2 75W Coe 2 Yokfield 2008 2930 MHz 16 4 Yes 4 95W Coe i7

opeations Queues: - 106 RISC ops - 24 intege ops - 36 FP/SSE ops - 44 ld/st

Flow Fo intege opeations 12 stages (Floating Point is 17 stages) Up to 106

6 Dynamically Scheduled CPU Banch pediction, Registe enaming Peseves dependencies Out-Of-Ode Intel All use O-O-O since 2001 Wait hee until all opeands available Micopocesso Yea Clock Rate Pipeline Stages Issue width Out-of-ode/ Speculation Coes Powe i MHz 5 1 No 1 5W Pentium MHz 5 2 No 1 10W Pentium Po MHz 10 3 Yes 1 29W P4 Willamette MHz 22 3 Yes 1 75W Execute and Hold Reode buffe fo egiste and memoy wites Can supply opeands fo issued instuctions Results also sent to any waiting esevation stations P4 Pescott MHz 31 3 Yes 1 103W Coe MHz 14 4 Yes 2 75W Coe 2 Yokfield MHz 16 4 Yes 4 95W Coe i7 Gulftown MHz 16 4 Yes 6 130W x86 instuctions RISC opeations Queues: RISC ops - 24 intege ops - 36 FP/SSE ops - 44 ld/st AMD Opteon X4 Micoachitectue enaming 16 achitectual egistes 72 physical egistes 4.11 Real Stuff: The AMD Opteon X4 (Bacelona) Pipeline AMD Opteon X4 Pipeline Flow Fo intege opeations 12 stages (Floating Point is 17 stages) Up to 106 RISC-ops in pogess Intel Nehalem is 16 stages fo intege opeations, details not evealed, but likely simila to above. Intel calls RISC opeations Mico opeations o μops

7 Does Multiple Issue Wok? The BIG Pictue Yes, but not as much as we d like Pogams have eal dependencies that limit ILP Some dependencies ae had to eliminate eg e.g., pointe aliasing Some paallelism is had to expose Limited window size duing instuction issue Memoy delays and limited bandwidth Had to keep pipelines full Speculation can help if done well New-School Machine Stuctues (It s a bit moe complicated!) Softwae Paallel Requests Assigned to compute e.g., Seach Katz Paallel Theads Assigned to coe e.g., Lookup, Ads Paallel Instuctions >1 one time e.g., 5 pipelined instuctions Paallel Data >1 data one time e.g., Add of 4 pais of wods Hadwae desciptions All gates functioning in paallel at same time Haness Paallelism & Achieve High Pefomance Hadwae Waehouse Scale Compute Poject 1 Coe Memoy Lab 14 Input/Output Coe Instuction Unit(s) Functional Unit(s) A 0 +B 0 A 1 +B 1 A 2 +B 2 A 3 +B 3 Smat Phone Compute Coe (Cache) Poject 2 Main Memoy Logic Gates Poject 39 3 Big Pictue on Paallelism Two types of paallelism in applications 1. Data-Level Paallelism (DLP): aises because thee ae many data items that can be opeated on at the same time 2. Task-Level Paallelism: aises because tasks of wok ae ceated that can opeate lagely in paallel 40 Big Pictue on Paallelism Hadwae can exploit app DLP and Task LP in fou ways: 1. Instuction-Level Paallelism: Hadwae exploits application DLP using ideas like pipelining and speculative execution 2. SIMD achitectues:exploit p app DLP by applying a single instuction to a collection of data in paallel 3. Thead-Level Paallelism: exploits eithe app DLP o TLP in a tightly-coupled hadwae model that allows fo inteaction among paallel theads 4. Request-Level Paallelism: exploits paallelism among lagely decoupled tasks and is specified by the pogamme of the opeating system 41 Pee Instuction Inst LP, SIMD, Thead LP, Request LP ae examples of Paallelism above the Instuction Set Achitectue Paallelism explicitly at the level of the ISA Paallelism below the level of the ISA Inst. LP SIMD Th. LP Req. LP Red = = = = = Geen = = = Puple = Blue 42 7

8 Pee Answe Inst LP, SIMD, Thead LP, Request LP ae examples of Paallelism above the Instuction Set Achitectue Paallelism explicitly at the level of the ISA Paallelism below the level of the ISA Inst. LP SIMD Th. LP Req. LP Red = = = = = Geen = = = Puple = Blue 43 Pee Question State if the following techniques ae associated pimaily with a softwae- o hadwae-based appoach to exploiting ILP (in some cases, the answe may be both): Supescala, Out-of- Ode execution, Speculation, Registe Renaming Out of Ode Supescala Speculation Registe Renaming Red HW HW HW HW SW SW SW SW Geen Both Both Both Both HW HW Both Both Puple HW HW HW Both Blue HW HW HW SW 44 Pee Answe State if the following techniques ae associated pimaily with a softwae- o hadwae-based appoach to exploiting ILP (in some cases, the answe may be both): Supescala, Out-of- Ode execution, Speculation, Registe Renaming Out of Ode Supescala Speculation Registe Renaming Red HW HW HW HW Oange SW SW SW SW Geen Both Both Both Both HW HW Both Both Pink HW HW HW Both Blue HW HW HW SW 45 And in Conclusion, Big Ideas of Instuction Level Paallelism Pipelining, Hazads, and Stalls Fowading, Speculation to ovecome Hazads Multiple issue to incease pefomance IPC instead of CPI Dynamic Execution: Supescala in-ode issue, banch pediction, egiste enaming, out-ofode execution, in-ode commit unoll loops in HW, hide cache misses 46 But wait thee s moe??? If we have time Review: C Memoy Management C has thee pools of data memoy (+ memoy) Static stoage: global vaiable stoage, basically pemanent, entie pogam un The Stack: local vaiable stoage, paametes, etun addess The Heap (dynamic stoage): malloc() gabs space fom hee, fee() etuns it Common (Dynamic) Memoy Poblems Using uninitialized values Accessing memoy beyond you allocated egion Impope use of fee/ealloc by messing with the pointe handle etuned by malloc Memoy leaks: mismatched malloc/fee pais pevents accesses between and 48 (via vitual memoy) 8

9 Simplest Model Only one pogam unning on the compute Addesses in the pogam ae exactly the physical memoy addesses Extensions to the simple model: What if less physical memoy than full addess space? What if we want to un multiple pogams at the same time? Poblem #1: Physical Memoy Less Than the Full Addess One achitectue, many implementations, with possibly diffeent amounts of memoy Memoy used to vey expensive and physically bulky Whee does the gow fom then? Logical Vitual Real Idea: Level of Indiection to Ceate Illusion of Lage Physical Memoy Hi Ode Bits of Vitual Addess Addess Map o Table Hi Ode Bits of Physical Addess Logical Vitual Physical Real Vitual Page Page Addess Addess 51 Poblem #2: Multiple Pogams Shaing the Machine s Addess How can we un multiple pogams without accidentally stepping on same addesses? How can we potect pogams fom clobbeing each othe? Application 1 52 Application 2 Idea: Level of Indiection to Ceate Illusion of Sepaate Addess s One table pe unning application OR Real swap table contents when switching 53 Extension to the Simple Model Multiple pogams shaing the same addess space E.g., Opeating system uses low end of addess ange shaed with application Multiple pogams in shaed (vitual) addess space Static ti management: fixed patitioning/allocation i of space Dynamic management: pogams come and go, take diffeent amount of time to execute, use diffeent amounts of memoy How can we potect pogams fom clobbeing each othe? How can we allocate memoy to applications on demand? 54 9

10 (4 GB - 64 MB) ~ hex ~ 0FFF FFFF hex 2 28 bytes (64 MB) Static Division of Shaed Addess Application Opeating System E.g., how to manage the caving up of the addess space among and applications? Whee does the end and the application begin? Dynamic management, with potection, would be bette! 55 Fist Idea: Base + Bounds Registes fo Location Independence Max add Location-independent pogams Pogamming and stoage management ease: need fo a base egiste Potection Independent pogams should not affect each othe inadvetently: need fo a bound egiste Histoically, base + bounds egistes wee a vey ealy idea in compute achitectue 0 add pog1 pog2 56 Physical Memoy Simple Base and Bound Tanslation lw X Bound Registe Effective Addess Base Registe Segment Length Bounds Violation? Physical Addess cuent segment Base Physical Addess Pogam Addess Base and bounds egistes ae visible/accessible to pogamme Tap to if bounds violation detected ( seg fault / coe dumped ) 57 + Physica al Memoy pgm 2 Pogams Shaing Memoy pgms 4 & 5 aive pgm 2 pgm 4 pgm 5 8K pgms 2 & 5 leave pgm 4 Why do we want to un multiple pogams? Run othes while waiting fo I/O What pevents pogams fom accessing each othe s data? fee 8K 58 Student Roulette? Restiction on Base + Bounds Regs Want only the Opeating System to be able to change Base and Bound Registes Pocessos need diffeent execution modes 1. Use mode: can use Base and Bound Registes, but cannot change them 2. Supeviso mode: can use and change Base and Bound Registes Also need Mode Bit (0=Use, 1=Supeviso) to detemine pocesso mode Also need way fo pogam in Use Mode to invoke opeating system in Supeviso Mode, and vice vesa 59 pgm 2 Pogams Shaing Memoy pgms 4 & 5 aive pgm 2 pgm 4 pgm 5 8K pgms 2 & 5 leave pgm 4 fee 8K As pogams come and go, the stoage is fagmented. Theefoe, at some stage pogams have to be moved aound to compact the stoage. Easy way to do this? 60 Student Roulette? 10

CS 61C: Great Ideas in Computer Architecture Instruc(on Level Parallelism: Mul(ple Instruc(on Issue

CS 61C: Great Ideas in Computer Architecture Instruc(on Level Parallelism: Mul(ple Instruc(on Issue CS 61C: Geat Ideas in Compute Achitectue Instuc(on Level Paallelism: Mul(ple Instuc(on Issue Instuctos: Kste Asanovic, Randy H. Katz hbp://inst.eecs.bekeley.edu/~cs61c/fa12 1 Paallel Requests Assigned