Chapter 3: Instruc0on Level Parallelism and Its Exploita0on

Size: px

Start display at page:

Download "Chapter 3: Instruc0on Level Parallelism and Its Exploita0on"

Ella Wilkinson
6 years ago
Views:

1 Chapter 3: Instruc0on Level Parallelism and Its Exploita0on - Abdullah Muzahid

2 Hardware- Based Specula0on (Sec0on 3.6) In mul0ple issue processors, stalls due to branches would be frequent: You may need to execute 1 branch/cycle Solu0on: Speculate the outcome of the branch and execute as if the guess was correct. This is called hardware specula0on Requires the ability to recover from mispredic0ons Hardware- based specula0on combines: Dynamic branch predic0on Specula0on to allow EX Aggressive dynamic scheduling across basic blocks (because now we aggressively execute past basic blocks)

3 Specula0ve Execu0on Based on Tomasulo Used in most processors: Intel Core i3, i5, i7, AMD Phenom, IBM Power 7 Key: separate the bypassing of results among instruc0ons from full instruc0on comple0on Allow a specula0ve instruc0on to pass its result to others But, do not complete the instruc0on (not allow it to update any undoable state) When the instruc0on is no longer specula0ve: allow it to modify the reg file or memory à Instruc0on Commit Step Overall: instruc0ons execute out of order (ooo), but they commit in order Prevent any irrevocable ac0on un0l an instruc0on commits

4 Reorder Buffer (ROB) Buffer that holds the result of instruc0ons that have completed but not commi_ed Hardware buffer Also used to pass results among instruc0ons Provides addi0onal registers like the Reserva0on Sta0ons Difference ROB- Tomasulo Res Sta0ons: Tomasulo Once the FU writes its result, subsequent instruc0ons find the value in the register set ROB ROB supplies the results of completed but not yet commi_ed instruc0ons

5 ROB Fields in an entry: Instruc0on type: BR (no dest), SD (dest is mem), Other (dest is reg) Des0na0on field: des0na0on register or des0na0on mem address Value: value un0l commit Ready: is the instruc0on finished (and therefore the result is ready)

7 Func0ons of RS and ROB Res Sta0ons: Place to buffer opera0ons and operands from the 0me I issues un0l it can execute RS track the ROB assigned for an instruc0on ROB: Does the renaming Every instruc0on has a posi0on in the ROB from when it issues un0l it commits We tag a result using the ROB entry number rather than the RS number

8 Overall Instruc0on Execu0on ISSUE or DISPATCH: Get an instruc0on from instruc0on queue Issue it if there is an empty RS and an empty ROB entry Send ops to the RS if they are available in any reg or ROB entry Send the number of the ROB entry to the RS, so that it can be used to tag the result of the execu0on when it is dumped in the CDB If no RS or no ROB entry: stall instruc0on and successors EXECUTE: Monitor for any of your inputs to be available in the CDB (this enforces RAW). If so, read it in the RS When both ops ready and the FU free, execute May take several cycles Loads require 2 steps (address genera0on and memory access); stores need only 1 step (address genera0on)

9 Overall Instruc0on Execu0on WRITE RESULT: When EX complete, dump result on the bus, together with the ROB entry number. Mark RS as available. The data is sent to other RS wai0ng for it, and to the correct ROB entry For stores, if the data value is already available, it is wri_en to ROB entry. Otherwise, it just waits in ROB checking the CDB for data value. As soon as, the data value is available in CDB, it is wri_en in the dependent store s ROB. COMMIT or GRADUATION: When an instruc0on reaches the head of the ROB and its result is present in the entry à update the Reg file and remove the ROB entry. If the instruc0on is a Store: Do same except that memory is updated If the instruc0on is a branch with incorrect predic0on: flush the ROB and restart execu0on at the correct successor If the instruc0on is a branch with correct predic0on: as usual

10 Workout Table Instruc0on Issue Exec B- E Mem Rd CDB Write Commit/ Mem Wr LD F6, 32(R2) LD F2, 44(R3) MULD F0, F2, F4 SUBD F8, F2, F6 DIVD F10, F0, F6 ADDD F6, F8, F2

11 Example of Branch Mispredic0on Example page 189 and Figure 3.13 Aker a branch is mispredicted, recover by clear the ROB for all instruc0ons that follow the branch Let branch and the previous instruc0ons to commit fetch at the correct branch successor Excep0ons Record the excep0on in the ROB entry If the excep0ng instruc0on is in the mispredicted path, simply squash the instruc0on and flush the excep0on. When the excep0ng instruc0on reaches the head ROB: process it Flush all pending instruc0ons

12 Workout Table Instruc0on Issue Exec B- E Mem Rd CDB Write Commit/ Mem Wr LD F0, 0(R1) MULD F4, F0, F2 SD F4, 0(R1) DADDI R1, R1, #- 8 BNE R1, R2, Loop LD F0, 0(R1) MULD F4, F0, F2 SD F4, 0(R1) DADDI R1, R1, #- 8 BNE R1, R2, Loop

13 Handling Stores Tomasulo s algorithm: stores update caches/memory as soon as they complete the WB Specula0ve processor: stores only update caches/memory when they reach the head of the ROB Therefore: memory not updated by a specula0ve instruc0on

14 Dependences Through Memory WAW: Eliminated because the upda0ng of memory occurs in order, when the WR is at the head of ROB Therefore, no earlier LD or ST can be pending WAR: same as WAW RAW: Maintained with two restric0ons: LD cannot be sent to memory if there is a previous ST in the ROB with the same address LD computes its effec0ve address in order rela0ve to all earlier ST Note: some specula0ve machines actually allow the bypass of a value from a previous ST to a successor LD without the LD having to go to memory

15 Mul0ple Issue Processors If all techniques described are successful CPI = 1 To reduce CPI below 1 issue mul0ple inst per cycle superscalar processors VLIW or EPIC processors

16 Superscalar Hardware examines instruc0on stream for parallel inst. Sta0c Scheduling: Issue mul0ple inst that are quickly detected as independent If one inst in set has some dependency, only prior inst are issued (maintains in- order execu0on) Must increase inst fetch rate: widen icache bus, pipeline IF (oken by having integrated inst fetch unit)

17 Superscalar (Contd.) Dynamic Scheduling: We use Tomasulo algorithm Example: issue 2 instruc0ons per cycle. These instruc0ons may even have dependences with each other In general, for n- issue width: use a combina0on of two approaches: Run the issue step at higher speed, so that e.g. two instruc0ons can be issued in 1 cycle Build wider logic, so that mul0ple instruc0ons can be issued at the same 0me and all dependences are checked

18 VLIW Very Large Instruc0on Word Compiler puts parallel inst. In VLIW packet Intel modified the idea a li_le bit and called it EPIC The compiler has responsibility to package mul0ple independent inst together in VLIW Can issue many instruc0ons in parallel: Past machines have varied from 2-32 inst Simplifies issuing hardware

19 VLIW (Contd.) Performance limited by amount of ILP compiler can discover With many func0onal units to feed, this requires very complicated compiler scheduling support: Loop unrolling Sokware pipelining Trace scheduling Superblock scheduling

20 Common name Issue structure Hazard detection Scheduling Distinguishing characteristic Examples Superscalar (static) Superscalar (dynamic) Superscalar (speculative) dynamic hardware static in-order execution mostly in the embedded space: MIPS and ARM dynamic hardware dynamic some out-of-order execution, but no speculation dynamic hardware dynamic with speculation VLIW/LIW static primarily software EPIC primarily static primarily software static mostly static out-of-order execution with speculation all hazards determined and indicated by compiler (often implicitly) all hazards determined and indicated explicitly by the compiler none at the present Pentium 4, MIPS R12K, IBM Power5 most examples are in the embedded space, such as the TI C6x Itanium

21 Example of 2- Issue Dynamic Scheduling Processor Loop LD R2, 0(R1) DADDIU R2, R2, #1 SD R2, 0(R1) DADDIU R1, R1, #8 BNE R2, R3, Loop Assume separate FUs for: effec0ve address calc, ALU ops, branch condi0on eval Assume 2 inst of any type can commit/cycle Assume 2 CDB

22 Workout Table Instruc0on Issue Exec begin- end LD R2, 0(R1) DADDI R2, R2, #1 SD R2, 0(R1) DADDI R1, R1, #8 BNE R2, R3, Loop LD R2, 0(R1) DADDI R2, R2, #1 SD R2, 0(R1) DADDI R1, R1, #8 BNE R2, R3, Loop LD R2, 0(R1) DADDI R2, R2, #1 SD R2, 0(R1) DADDI R1, R1, #8 BNE R2, R3, Loop Mem Access CDB write

23 Workout Table Instruc0on Issue Exec begin- end Mem Access LD R2, 0(R1) DADDI R2, R2, # SD R2, 0(R1) DADDI R1, R1, # BNE R2, R3, Loop LD R2, 0(R1) DADDI R2, R2, # SD R2, 0(R1) DADDI R1, R1, # BNE R2, R3, Loop LD R2, 0(R1) DADDI R2, R2, # SD R2, 0(R1) DADDI R1, R1, # BNE R2, R3, Loop CDB write

24 Workout Table Instruc0on Issue Exec begin- end LD R2, 0(R1) 1 DADDI R2, R2, #1 1 SD R2, 0(R1) 2 DADDI R1, R1, #8 2 BNE R2, R3, Loop 3 LD R2, 0(R1) 3 DADDI R2, R2, #1 4 SD R2, 0(R1) 4 DADDI R1, R1, #8 5 BNE R2, R3, Loop 5 LD R2, 0(R1) 6 DADDI R2, R2, #1 6 SD R2, 0(R1) 7 DADDI R1, R1, #8 7 BNE R2, R3, Loop 8 Mem Rd CDB write Commit/ Mem Write

25 Instruc0on Issue Exec begin- end Workout Table Mem Rd CDB write Commit/ Mem Write LD R2, 0(R1) DADDI R2, R2, # SD R2, 0(R1) DADDI R1, R1, # BNE R2, R3, Loop LD R2, 0(R1) DADDI R2, R2, # SD R2, 0(R1) DADDI R1, R1, # BNE R2, R3, Loop LD R2, 0(R1) DADDI R2, R2, # SD R2, 0(R1) DADDI R1, R1, # BNE R2, R3, Loop

26 Register Renaming Key: think of architectural registers as names, not loca0ons We can have more loca0ons than names Dynamically map names to loca0ons A map table contains the current mappings [name- >loca0on] Reg write: allocate a loca0on and record it in map table Reg read: find the last write to that name in the map table Minor detail: de- allocate loca0ons carefully Copyright 2012 by Muzahid, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipas0

27 An Example names: r1,r2,r3, locations: l1,l2,l3,l4,l5,l6,l7 original mapping: r1 l1, r2 l2, r3 l3 (l4-l7 free ) map table raw instructions r1 r2 r3 free locations renamed instructions l1 l2 l3 l4,l5,l6,l7 add r1,r2,r3 l4 l2 l3 l5,l6,l7 add l4,l2,l3 sub r3,r2,r1 l4 l2 l5 l6,l7 sub l5,l2,l4 mul r1,r2,r3 l6 l2 l5 l7 mul l6,l2,l5 div r2,r1,r3 l6 l7 l5 div l7,l6,l5 Copyright 2012 by Muzahid, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipas0

28 Considera0ons for Spec Machines 1. ROB vs Register Renaming: Using ROB architectural registers contained in combina0on of register set, reserva0on sta0ons, and ROB if we do not issue instr for a while: as instruc0ons commit, reg values will appear in the register file Explicit set of physical regs + register renaming: Physical regs hold both the architecturally visible regs and temp values Extended regs play role of both res sta0ons + ROB WAW and WAR hazards eliminated by renaming the des0na0on register Copyright J. Torrellas 1999,2001,2002,

29 Considera0ons for Spec Machines 2. How much to speculate: Advantage of specula0on: execute past branches Disadvantage: may execute a lot of useless instruc0ons Consequently: execute under specula0ve mode only low- cost events: Misses in L1: Yes TLB misses: No. In this case, wait un0l the instruc0on causing the event becomes non- specula0ve Copyright J. Torrellas 1999,2001,2002,

30 Considera0ons for Spec Machines 3. Specula0ng through mul0ple branches: Some programs have a high frequency of branches or branch clustering - - > speculate across mul0ple branches Complicates the process of specula0on recovery It is hard to speculate on more than one branch per cycle Copyright J. Torrellas 1999,2001,2002,

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would