This Unit: Dynamic Scheduling. Can Hardware Overcome These Limits? Scheduling: Compiler or Hardware. The Problem With In-Order Pipelines

Size: px

Start display at page:

Download "This Unit: Dynamic Scheduling. Can Hardware Overcome These Limits? Scheduling: Compiler or Hardware. The Problem With In-Order Pipelines"

Stanley Hodge
6 years ago
Views:

1 This Uit: Damic Schedulig CSE 560 Computer Sstems Architecture Damic Schedulig Slides origiall developed b Drew Hilto (IBM) ad Milo Marti (Uiversit of Peslvaia) App App App Sstem software Mem CPU I/O Code schedulig To reduce pipelie stalls To icrease ILP (is level parallelism) Two approaches to schedulig Last Uit: Static schedulig b the compiler This Uit: Damic schedulig b the hardware 1 2 Schedulig: Compiler or Hardware Compiler + Potetiall large schedulig scope (full program) + Simple hardware fast clock, short pipelie, ad low power Low brach predictio accurac (profilig?) Little iformatio o memor depedeces (profilig?) Ca t damicall respod to cache misses Pai to speculate ad recover from mis-speculatio (h/w support?) Hardware + High brach predictio accurac + Damic iformatio about memor depedeces + Ca respod to cache misses + Eas to speculate ad recover from mis-speculatio Fiite bufferig resources fudametall limit schedulig scope Schedulig machier adds pipelie stages ad cosumes power Ca Hardware Overcome These Limits? Damicall-scheduled processors Also called out-of-order processors Hardware re-schedules iss withi a slidig widow of VoNeuma iss As with pipeliig ad superscalar, ISA uchaged Same hardware/software iterface, appearace of i-order Icreases schedulig scope Does loop urollig trasparetl Uses brach predictio to uroll braches Examples: Petium Pro/II/III (3-wide), Core 2 (4-wide), Alpha (4-wide), MIPS R10000 (4-wide), Power5 (5-wide) Basic overview of approach 3 4 The Problem With I-Order Pipelies addf f0,f1 f2 F D E+ E+ E+ W mulf f2,f3 f2 F d* d* D E*E*E*E*E*W subf f0,f1 f4 F p* p* D E+ E+ E+ W What s happeig i ccle 4? mulf stalls due to data depedece OK, this is a fudametal problem subf stalls due to pipelie hazard Wh? subf ca t proceed ito D because mulf is there That is the ol reaso, ad it is t a fudametal oe Maitaiig i-order writes to reg. file (both write f2) Wh ca t subf go ito D i ccle 4 ad E+ i ccle 5? 5 A Word About Data Hazards Real is sequeces pass values via registers/memor Three kids of data depedeces (where s the fourth?) Read-after-write (RAW) True-depedece R E G M E M add r2,r3 r1 sub r1,r4 r2 or r6,r3 r1 st r1 [r2] ld[r2] r4 Write-after-read (WAR) Ati-depedece add r2,r3 r1 sub r5,r4 r2 or r6,r3 r1 ld[r1] r2 st r3 [r1] Write-after-write (WAW) Output-depedece add r2,r3 r1 sub r1,r4 r2 or r6,r3 r1 st r1 [r2] st r3 [r2] Ol oe depedece betwee a two iss (RAW has priorit) Focus o RAW depedeces WAR ad WAW: less commo, just bad amig luck Elimiated b usig ew register ames, (ca t reame memor!) 6

2 Fid the RAW, WAR, ad WAW depedeces Fid the RAW, WAR, ad WAW depedeces add r1 r2, r3 sub r4 r1, r5 ad r2 r4, r7 xor r10 r2, r11 or r12 r10, r13 mult r1 r10, r13 add r1 r2, r3 sub r4 r1, r5 ad r2 r4, r7 xor r10 r2, r11 or r12 r10, r13 mult r1 r10, r13 RAW depedecies: r1 from add to sub r2 from ad to xor r10 from xor to or r10 from xor to mult WAR depedecies: r2 from add to ad r1 from sub to mult WAW depedecies: r1 from add to mult 7 8 Raw iss: Code Example True Depedecies add r2,r3 r1 sub r2,r1 r3 mul r2,r3 r3 div r1,4 r1 False Depedecies add r2,r3 r1 sub r2,r1 r3 mul r2,r3 r3 div r1,4 r1 True (real) & False (artificial) depedecies Divide is idepedet of subtract ad multipl iss Ca execute i parallel with subtract Ma registers re-used Just as i static schedulig, the register ames get i the wa How does the hardware get aroud this? Approach: (step #1) reame registers, (step #2) schedule Step #1: Register Reamig To elimiate register coflicts/hazards Architected vs. Phsical registers level of idirectio Names: r1,r2,r3 Locatios:,,,,,, Origial mappig: r1, r2, r3, are available MapTable FreeList r1 r2 r3,,,,,, add r2,r3,r1 sub r2,r1,r3 mul r2,r3,r3 div r1,4,r1 add,, sub,, mul,, div,4, Reamig: coceptuall write each register oce + Removes false depedeces +Leaves true depedeces itact! Whe to reuse a phsical register? After overwritig is doe 9 10 I$ B P Step #2: Damic Schedulig D add,, sub,, mul,, div,4, is buffer S regfile D$ Out-of-order Pipelie Buffer of istructios Read Table P2 P3 P4 P5 P6 P7 Yes Yes add,, Yes Yes Yes sub,, ad div,4, Yes Yes Yes Yes Yes mul,, Yes Yes Yes Yes Yes Yes Istructios fetch/decoded/reamed ito Istructio Buffer AKA istructio widow or istructio scheduler Istructios (coceptuall) check read bits ever ccle Execute whe read Fetch Decode Reame I-order frot ed Dispatch Issue Reg-read Execute Writeback Out-of-order executio Commit 11 12

3 REGISTER RENAMING 13 Register Reamig Algorithm Data structures: maptable[architectural_reg] phsical_reg Free list: get/put free register Algorithm: at decode for each istructio: is.phs_iput1 = maptable[is.arch_iput1] is.phs_iput2 = maptable[is.arch_iput2] is.phs_to_free = maptable[arch_output] ew_reg = get_free_phs_reg() is.phs_output = ew_reg maptable[arch_output] = ew_reg At commit Oce all older istructios have committed, free register put_free_phs_reg(is.phs_to_free) 14 xor, r1 r2 r3 r4 r5 r1 r2 r3 r4 r xor, xor, r1 r2 r3 r4 r5 r1 r2 r3 r4 r

4 add, r3 r4 r5 r3 r4 r sub, r3 r4 r5 r3 r4 r sub, sub, r3 r4 r5 r3 r4 r

5 sub, addi, 1 sub, addi, 1 r3 r4 r5 r3 r4 r Out-of-order Pipelie sub, addi, 1 Buffer of istructios r1 r3 r4 r5 Fetch Decode Reame Dispatch Have uique register ames Now put ito ooo executio structures Issue Reg-read Execute Writeback Commit Dispatch Reamed istructios ito ooo structures Re-order buffer (ROB) Holds all istructios util the commit U-executed istructios Cetral piece of schedulig logic Cotet Addressable Memor (CAM) (more later) DYNAMIC SCHEDULING 29 30

6 Holds u-executed istructios Tracks read iputs Phsical register ames + read bit AND to tell if read Dispatch Steps Allocate IQ slot Full? Stall Read read bits of iputs Table 1-bit per preg Clear read bit of output i table Istructio has ot produced value et Write data i IQ slot Is I R I R Dst Age Read? sub, addi, 1 Read bits sub, addi, 1 Read bits xor sub, addi, 1 Read bits sub, addi, 1 Read bits xor 0 add 1 xor 0 add 1 sub

7 sub, addi, 1 xor 0 add 1 sub 2 addi Read bits Executio (ooo) stages Select read istructios Sed for executio Wakeup depedets Out-of-order pipelie Issue Reg-read Execute Writeback Damic Schedulig/Issue Algorithm Data structures: Read table[phs_reg] es/o queue) (part of issue Algorithm at schedule stage (prior to read registers): foreach istructio: if table[is.phs_iput1] == read && table[is.phs_iput2] == read the is is read select the oldest read istructio table[is.phs_output] = read Issue = Select + Wakeup Select N oldest, read istructios N=1, xor N=2, xor ad sub Note: ma have executio resource costraits: i.e., load/store/fp xor 0 add 1 sub 2 addi Read! Read! Issue = Select + Wakeup Wakeup depedet istructios CAM search for Dst i iputs Set read Also update read-bit table for future istructios xor 0 add 1 sub 2 addi Read bits Select/Wakeup oe ccle Depedets go back to back Next ccle: add/addi are read: Issue add 1 addi

8 Register Read Whe do istructios read the register file? Optio #1: after select, right before execute (Not doe at decode) Read phsical register (reamed) Or get value via bpassig (based o phsical register ame) This is Petium 4, MIPS R10k, Alpha stle Phsical register file ma be large Multi-ccle read Optio #2: as part of issue, keep values i Petium Pro, Core 2, Core i7 43

Static & Dynamic Instruction Scheduling

CS3014: Concurrent Systems Static & Dynamic Instruction Scheduling Slides originally developed by Drew Hilton, Amir Roth, Milo Martin and Joe Devietti at University of Pennsylvania 1 Instruction Scheduling