TDT 4260 lecture 7 spring semester 2015

Size: px

Start display at page:

Download "TDT 4260 lecture 7 spring semester 2015"

Melvyn Montgomery
6 years ago
Views:

1 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU

2 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding Today Instruction fetch optimization Thread level parallelism (TLP) Multithreading CMP design space exploration (if time) Scientific paper

3 3 Repetition

4 4 Repetition IF ID/RF EX MEM WB I n s t r. dadd r1,r2,r3 Ifetch dsub r4, r1,r3 Reg Ifetch ALU Reg DMem ALU Reg DMem Reg O r d e r and r6, r1,r7 or r8, r1,r9 xor r10, r1,r11 Ifetch Reg Ifetch ALU Reg Ifetch DMem ALU Reg Reg DMem ALU Reg DMem Reg

5 5 Compiler techniques for ILP For a given pipeline and superscalarity How can these be best utilized? Minimize no of stalls from hazards What can be done by the compiler? Has ages to spend, but less knowledge Static scheduling, but what else?

6 6 Extend the pipeline with functional units for FP Figure C.33, page C.52 Note the loops

7 7 Example Source code: for (i = 1000; i >0; i=i-1) x[i] = x[i] + s; Notice: Loop is parallel No dependencies between iterations High loop overhead Loop unrolling MIPS: Loop: L.D F0,0(R1) ; F0 = x[i] ADD.D F4,F0,F2 ; F2 = s S.D F4,0(R1) ; Store x[i] + s DADDUI R1,R1,#-8 ; x[i] is 8 bytes BNE R1,R2,Loop ; R1 = R2?

8 8 Pipeline Stalls Loop: L.D F0,0(R1) stall ADD.D F4,F0,F2 stall stall S.D F4,0(R1) DADDUI R1,R1,#-8 stall (assume integer load latency is 1) BNE R1,R2,Loop Figure 3.2

9 9 Static scheduling Loop: L.D F0,0(R1) stall ADD.D F4,F0,F2 stall stall S.D F4,0(R1) DADDUI R1,R1,#-8 stall BNE R1,R2,Loop Loop: L.D F0,0(R1) DADDUI R1,R1,#-8 ADD.D F4,F0,F2 stall stall S.D F4,8(R1) BNE R1,R2,Loop Result: From 9 cycles per iteration to 7 (Delays from table in figure 3.2)

10 10 Loop unrolling Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop Reduced loop overhead Requires number of iterations divisible by n (here n=4) Register renaming Offsets have changed Stalls not shown Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) DADDUI R1,R1,#-32 BNE R1,R2,Loop

11 11 Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) DADDUI R1,R1,#-32 BNE R1,R2,Loop Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4,0(R1) S.D F8,-8(R1) DADDUI R1,R1,#-32 S.D F12,-16(R1) S.D F16,-24(R1) BNE R1,R2,Loop Avoids stall after: L.D(1), ADD.D(2), DADDUI(1)

12 12 Loop unrolling example: Summary Original code 9 cycles per element Scheduling 7 cycles per element Loop unrolling 6,75 cycles per element Unrolled 4 iterations Combination 3,5 cycles per element Avoids stalls entirely Compiler reduced execution time by 61%

13 13 Loop unrolling in practice Do not usually know upper bound of loop Suppose it is n, and we would like to unroll the loop to make k copies of the body Instead of a single unrolled loop, we generate a pair of consecutive loops: 1st executes (n modk) times and has a body that is the original loop 2nd is the unrolled body surrounded by an outer loop that iterates (n/k) times For large values of n, most of the execution time will be spent in the unrolled loop

14 14 Loop Unrolling - Disadvantages - Growth in code size For larger loops, it can increase the instruction cache miss rate - Register pressure: potential shortfall in registers created by aggressive unrolling and scheduling If not possible to allocate all live values to registers, may lose some or all of its advantage

15 15 Getting CPI below 1 CPI 1 if issue only 1 instruction every clock cycle Multiple-issue processors come in 3 flavors: 1. Statically-scheduled superscalar processors In-order execution Varying number of instructions issued (compiler) 2. Dynamically-scheduled superscalar processors Out-of-order execution (OOO or O-O-O) Varying number of instructions issued (CPU) 3. VLIW (very long instruction word) processors In-order execution Fixed number of instructions issued

16 16 VLIW

17 17 VLIW: Very Long Instruction Word (1/2) Each VLIW has explicit coding for multiple operations Several instructions combined into packets Trade-off instruction space for simple decoding Room for many operations Independent operations => execute in parallel E.g.: 2 FP ops, 1 integer op, 1 branch, 1 Memory refs

18 18 VLIW: Very Large Instruction Word (2/2) Important to avoid empty instruction slots Loop unrolling Local scheduling Global scheduling Scheduling across branches In the following example we assume a VLIW ISA with - 2 load/store - 2 fp - 1 int/branch

19 19 Recall: Unrolled Loop that minimizes stalls for Scalar Source code: for (i = 1000; i >0; i=i-1) x[i] = x[i] + s; Register mapping: s F2 i R1 Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4,0(R1) S.D F8,-8(R1) DADDUI R1,R1,#-32 S.D F12,-16(R1) S.D F16,-24(R1) BNE R1,R2,Loop

20 20 Loop Unrolling in VLIW Memory Memory FP FP Int. op/ Clock reference 1 reference 2 operation 1 op. 2 branch L.D F0,0(R1) L.D F6,-8(R1) 1 L.D F10,-16(R1) L.D F14,-24(R1) 2 L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3 L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4 ADD.D F20,F18,F2 ADD.D F24,F22,F2 5 S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6 S.D -16(R1),F12 S.D -24(R1),F16 7 S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#48 8 S.D -0(R1),F28 BNEZ R1,LOOP 9 Unrolled 7 iterations to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 vs. 6 in SuperScalar)

21 21 Problems with 1st Generation VLIW Increase in code size Loop unrolling Partially empty VLIW instruction slots Operated in lock-step; no hazard detection HW A stall in any functional unit pipeline causes entire processor to stall, since all functional units must be kept synchronized Compiler might predict function units, but caches hard to predict Reduced binary code compatibility Strict VLIW different numbers of functional units and unit latencies require different versions of the code

22 22 INSTRUCTION FETCHING

23 23 Instruction fetching Want to issue >1 instruction every cycle Several problems Bandwidth / Latency Determining which instructions Jumps Branches Integrated instruction fetch unit

If match is found there, it is a taken branch, and the predicted PC contains

24 24 Branch Target Buffer (BTB) To reduce misspenalty PC is checked against entries i 1st. column that stores addresses of known branches. If match is found there, it is a taken branch, and the predicted PC contains the prediction of the next PC after the branch, and fetching can begin immediately at that address

25 25 Return Address Predictor Small buffer of return addresses acts as a stack 70% Caches most 60% recent return addresses 50% Call Push a 40% return address on stack 30% Return Pop 20% an address off 10% stack & predict as new PC 0% Misprediction frequency Zero buffer entries implies that the standard branch prediction is used Return address buffer entries go m88ksim cc1 compress xlisp ijpeg perl vortex

26 26 Integrated Instruction Fetch (IF) Units Recent designs have implemented the fetch stage as a separate, autonomous unit Multiple-issue IF in one simple pipeline stage is too complex An integrated fetch unit provides: Branch prediction Instruction prefetch Instruction memory access and buffering

27 27 Limits to ILP Advances in compiler technology + significantly new and different hardware techniques may be able to overcome limitations assumed in studies However, unlikely such advances when coupled with realistic hardware will overcome these limits in near future How much ILP is available using existing mechanisms with increasing HW budgets?

28 28 Ideal HW Model 1. Register renaming infinite virtual registers all register WAW & WAR hazards are avoided 2. Branch prediction perfect; no mispredictions 3. Jump prediction all jumps perfectly predicted 2 & 3 no control dependencies; perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis addresses known & a load can be moved before a store provided addresses not equal 1 & 4 eliminates all but the true data dependencies 5. perfect caches; 1 cycle latency for all instructions; unlimited instructions issued/clock cycle

29 29 Upper Limit to ILP: Ideal Machine (Figure 3.26) SPEC92 benchmarks Instructions Per Clock Integer: FP: gcc espresso li fpppp doducd tomcatv Programs

30 30 Instruction window Ideal HW need to know entire code Obviously not practical Register dependencies scales quadratically Window: The set of instructions examined for simultaneous execution How does the size of the window affect IPC? Too small window Can t see whole loops Too large window Hard to implement

31 31 More Realistic HW: Window Impact Figure 3.27 Assume can issue up to 64 instructions in one cycle

32 32 TLP

33 33 Thread Level Parallelism (TLP) ILP exploits implicit parallel operations within a loop or straight-line code segment TLP is explicitly represented by the use of multiple threads of execution that are inherently parallel Use multiple instruction streams to improve: 1. Throughput of computers that run many programs 2. Execution time of a single application implemented as a multi-threaded program (parallel program)

34 34 Multi-threaded execution Multi-threading: multiple threads share the functional units of 1 processor via overlapping Must duplicate independent state of each thread e.g., a separate register file, PC and page table Memory shared through virtual memory mechanisms HW for fast thread switch; much faster than full process switch (context switch) 100s to 1000s of clocks When switch? fine grained vs. coarse grained

35 35 Fine-Grained Multithreading Switches between threads on each instruction Multiples threads interleaved Usually round-robin fashion, skipping stalled threads CPU must be able to switch threads every clock Hides both short and long stalls But slows down execution of individual threads Thread ready to execute without stalls will be delayed by instructions from other threads Used on Sun s Niagara (T1)

36 36 Coarse-Grained Multithreading Switch threads only on costly stalls (L2 cache miss) Advantages No need for very fast thread-switching Doesn t slow down thread, since switches only when thread encounters a costly stall Disadvantage: hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen New thread must fill pipeline before instructions can complete

37 37 Do both ILP and TLP? TLP and ILP exploit two different kinds of parallel structure in a program Can a high-ilp processor also exploit TLP? Functional units often idle because of stalls or dependences in the code Can TLP be a source of independent instructions that might reduce processor stalls? Can TLP be used to employ functional units that would otherwise lie idle with insufficient ILP? => Simultaneous Multi-threading (SMT) Intel: Hyper-Threading

38 38 Simultaneous Multi-threading Cycle One thread, 8 units M M FX FX FP FP BR CC Cycle Two threads, 8 units M M FX FX FP FP BR CC M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes

39 39 Simultaneous Multi-threading (SMT) A dynamically scheduled processor already has many HW mechanisms to support multi-threading Large set of virtual registers Virtual = not all visible at ISA level Register renaming (Tomasulo, course TDT4255) Dynamic scheduling Just add a per thread renaming table and keeping separate PCs Independent commitment can be supported by logically keeping a separate reorder buffer for each thread

40 40 Multi-threaded categories Time (processor cycle) Superscalar Fine-Grained Coarse-Grained Multiprocessing Simultaneous Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot

41 41 Design Challenges in SMT SMT makes sense only with fine-grained implementation How to reduce the impact on single thread performance? Give priority to one or a few preferred threads Large register file needed to hold multiple contexts Not affecting clock cycle time, especially in Instruction issue - more candidate instructions need to be considered Instruction completion - choosing which instructions to commit may be challenging Ensuring that cache and TLB conflicts generated by SMT do not degrade performance Threads should be good neighbours QoS and fairness

42 UltraSPARC T1 ( Niagara ) Target: Commercial server applications High thread level parallelism (TLP) Large numbers of parallel client requests Low instruction

42 42 UltraSPARC T1 ( Niagara ) Target: Commercial server applications High thread level parallelism (TLP) Large numbers of parallel client requests Low instruction level parallelism (ILP) High cache miss rates Many unpredictable branches Approach: Multicore, Fine-grain multithreading, Simple pipeline, Small L1 caches, Shared L2

43 43 T1 processor logical overview 1.2 GHz at 72W typical, 79W peak power consumption

44 44 T1 pipeline / 4 threads Single issue, in-order, 6-deep pipeline: F, S, D, E, M, W S = switch/select Shared units: Note: thread select logic & L1 cache, L2 cache thread state storage TLB Exec. units pipe registers Separate units: PC instruction buffer reg file store buffer

45 45 UltraSPARC T1 and T2 available as open-source hardware

46 46 T1 Multithreading Unicore Performance, (fig 3.31)

47 47 Not Ready Breakdown (fig 3.32) Fraction of cycles not ready 100% 80% 60% 40% 20% 0% TPC-C SPECJBB SPECWeb99 Other Pipeline delay L2 miss L1 D miss L1 I miss Cache effects responsible for 50% - 75% of waiting SPECJBB has a higher amount of pipeline-delay because it has a higher branch frequency

48 48 CPI Breakdown of Performance Benchmark Per Thread CPI Per core CPI Effective CPI for 8 cores Effective IPC for 8 cores TPC-C SPECJBB SPECWeb Figure 3.33

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer