Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Size: px

Start display at page:

Download "Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]"

Jeffery McDonald
5 years ago
Views:

1 Review and Advanced d Concepts Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

2 Pipelining Review PC IF/ID ID/EX EX/M M/WB IF ID EX M W W

3 I0: add R4 = R1 + R0 I1: sub R9 = R3 R4 I2: add R4 = R5 + R6 I3: lw R2, 100 (R3) I4: lw R2, 0 (R2) I5: sw R2, 100 (R4) I6: and R2 = R2 & R1 I7: beq R9 == R1, TARGET I8: and R9 = R9 & R1 PC IF IF/ID ID ID/EX EX/M M/WB EX M W c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 I0 I1 I2 I3 I4 I4/I5 I5 I6 I7 I8 IF ID EX M W IF ID EX M W IF ID EX M W IF ID EX M W IF ID EX M W IF ID EX M W IF ID EX M W IF ID EX M W IF ID EX M W IF ID EX M W

4 I0 I1 I2 I3 I4 I4/I5 I5 I6 I7 I8 IF ID EX M W IF ID EX M W IF ID EX M W IF ID EX M W IF ID EX M W IF ID EX M W IF ID EX M W IF ID EX M W IF ID EX M W IF ID EX M W C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C18 I0 ME IF ID EX WB M I1 ME IF ID EX WB M I2 ME IF ID EX WB M I3 ME W IF ID EX M B I4 ME IF ID X EX WB M I5 ME IF X ID EX WB M I6 ME X IF ID EX WB M I7 ME IF ID EX WB M I8 ME IF ID EX WB M

5 Memory Hierarchy Review VA VA HW VM L1 M S ¼ t hit ¾ t VA PA miss TLB Cache Lookup miss hit Translation data Main Memory

6 4 Questions for the Memory Hierarchy Q1: Where can a entry be placed in the upper level? (Entry placement) Q2: How is a entry found if it is in the upper level? (Entry identification) Q3: Which entry should be replaced on a miss? (Entry replacement) Q4: What happens on a write? (Write strategy)

7 Q1&Q2: Where can a entry be placed/found? # of sets Entries per set Direct mapped # of entries 1 Set associative (# of entries)/ associativity Associativity (typically 2 to 16) Fully associative 1 # of entries Location method # of comparisons Direct mapped Index 1 Set associative Index the set; compare set s tags Degree of associativity it Fully associative Compare all entries tags Separate lookup (page) # of entries 0 table

8 Q3: Which entry should be replaced on a miss? Easy for direct mapped only one choice Set associative or fully associative Random LRU (Least Recently Used) For a 2 way set associative, random replacement has a miss rate about 1.1 times higher than LRU LRU is too costly to implement for high levels of associativity (> 4 way) since tracking the usage information is costly

9 Q4: What happens on a write? Write through through The information is written to the entry in the current memory level and to the entry in the next level of the memory hierarchy Always combined with a write buffer so write waits to next level memory can be eliminated (as long as the write buffer doesn t fill) Write back The information is written only to the entry in the current memory level. The modified entry is written to next level of memory only when it is replaced. Need a dirty bit to keep track of whether the entry is clean or dirty Virtual memory systems always use write back of dirty pages to disk Pros and cons of each? Write through: read misses don t result in writes (so are simpler and cheaper), easier to implement Write back: writes run at the speed of the cache; repeated writes require only one write to lower level

10 Improving Performance Performance = Instr Count * CPI * Clock time Average CPI > Ideal CPI Get CPI close to the ideal one (Forwarding, branch prediction, multithreading, prefetch) Exploit Instruction level lparallelism li (ILP) (pipelining, superscalar, out of order) Exploit data level parallelism (DLP) (vector processors) (will not discuss multicore or multiprocessors)

11 Instruction Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle CPI * Clock time Multiple issue Replicate pipeline stages multiple pipelines Start multiple instructions per clock cycle E.g., 4GHz 4 way multiple issue li l i 16 BIPS, peak CPI = 0.25, peak IPC = 4 But dependencies reduce this in practice

12 Multiple Issue Static multiple issue Compiler groups instructions to be issued together Packages them into issue slots Compiler detects and avoids hazards Dynamic multiple issue CPU examines instruction stream and chooses instructions to issue each cycle Compilercan help by reordering instructions CPU resolves hazards using advanced techniques at runtime

13 MIPS with Static Dual Issue Two issue packets One ALU/branch instruction One load/store instruction 64 bit aligned ALU/branch, then load/store Pad an unused instruction with nop Address Instruction type Pipeline Stages n ALU/branch IF ID EX MEM WB n + 4 Load/store IF ID EX MEM WB n + 8 ALU/branch IF ID EX MEM WB n + 12 Load/store IF ID EX MEM WB n + 16 ALU/branch IF ID EX MEM WB n + 20 Load/store IF ID EX MEM WB

14 MIPS with Static Dual Issue

15 Static Multiple Issue Compiler groups instructions into issue packets Group of instructions that can be issued on a single cycle Determined by pipeline resources required Think of an issue packet as a very long instruction Specifies multiple concurrent operations Very Long Instruction Word (VLIW)

16 Scheduling Static Multiple Issue Compiler must remove some/all hazards Reorder instructions into issue packets No dependencies with a packet Possibly some dependencies between packets Varies between ISAs; compilermustknow! must Pad with nop if necessary

17 Hazards in the Dual Issue MIPS More instructions executing in parallel EX data hazard Forwardingavoided avoided stallswithsingle issue single issue Now can t use ALU result in load/store in same packet add $t0, $s0, $s1 load $s2, 0($t0) Split into two packets, effectively a stall Load use hazard Still one cycle use latency, but now two instructions More aggressive scheduling required

18 Scheduling Example Schedule this for dual issue MIPS Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1, 4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0 ALU/branch Load/store cycle Loop: nop lw $t0, 0($s1) 1 addi $s1, $s1, 4 nop 2 addu $t0, $t0, $s2 nop 3 bne $s1, $zero, Loop sw $t0, 4($s1) 4 IPC = 5/4 = 1.25(c.f. peak IPC = 2)

19 Loop Unrolling Replicate loop body to expose more parallelism Reduces loop control overhead Use different registers per replication Clld Called register renaming Avoid loop carried anti dependencies Store followed dby a load of the same register Aka name dependence Reuse of a register name

20 Loop Unrolling Example ALU/branch Load/store cycle Loop: addi $s1, $s1, 16 lw $t0, 0($s1) 1 nop lw $t1, 12($s1) 2 addu $t0, $t0, $s2 lw $t2, 8($s1) 3 addu $t1, $t1, $s2 lw $t3, 4($s1) 4 addu $t2, $t2, $s2 sw $t0, 16($s1) 5 addu $t3, $t4, $s2 sw $t1, 12($s1) 6 nop sw $t2, 8($s1) 7 bne $s1, $zero, Loop sw $t3, 4($s1) 8 IPC = 14/8 = 1.75 Closer to 2, but at cost of registers and code size

21 Speculation Guess what to do with an instruction Start operation as soon as possible Check whether guess was right If so, complete the operation If not, roll back and do the right thing Common to static and dynamic multiple issue Examples Speculate on branch outcome Roll back if path taken is different Speculate on load Roll back if location is updated

22 Compiler/Hardware Speculation Compiler can reorder instructions e.g., move load before branch Can include fix up instructions to recover from incorrect guess Hardware can look ahead for instructions to execute Buffer results until it determines they are actually needed Flush buffers on incorrect speculation

23 Speculation and Exceptions What if exception occurs on a speculatively executed instruction? e.g., g, speculative load before null pointer check Static speculation Can add ISA support for deferring exceptions Dynamic speculation Canbuffer exceptions until instruction completion (which may not occur) Chapter 4 The Processor 23

24 Dynamic Multiple Issue Superscalar processors CPU decides whether to issue 0, 1, 2, each cycle Avoiding structural and data hazards Avoids the need for compiler scheduling Though it may still help Code semantics ensured by the CPU

25 Dynamic Pipeline Scheduling Allow the CPU to execute instructions out of order to avoid stalls But commit result to registers in order Example lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 slti $t5, $s4, 20 Can start sub while addu is waiting for lw

26 Dynamically Scheduled CPU Preserves dependencies Hold pending operands Results also sent to any waiting reservation stations Reorders buffer for register writes Can supply operands for issued instructions

27 Register Renaming Reservation stations and reorder buffer effectively provide register renaming On instruction issue to reservation station If operand is available in register file or reorder buffer Copied to reservation station No longer required in the register; can be overwritten If operand is not yet available It will be provided to the reservation station by a function unit Register update may not be required

28 Why Do Dynamic Scheduling? Why not just let thecompiler schedule code? Not all stalls are predicable e.g., cache misses Can t always schedule around branches Branch outcome is dynamically determined Different implementations of an ISA have different latencies and hazards

29 Does Multiple Issue Work? The BIG Picture Yes, but not as much as we d lk like Programs have real dependencies that limit ILP Some dependencies are hard to eliminate e.g., pointer aliasing Some parallelism a is hard adto expose epose Limited window size during instruction issue Memory delays and limited bandwidth Hard to keep pipelines full Speculation can help if done well

30 The Opteron X4 Microarchitecture 72 physical p y registers

31 The Opteron X4 Pipeline Flow For integer operations FP is 5 stages longer Up to 106 RISC ops in progress Bottlenecks Complex instructions with long dependencies Branch mispredictions Memory access dl delays

32 Multithreading Performing multiple threads of execution in parallel Replicate registers, PC, etc. Fast switching between threads Fine grain multithreading Switch threads after each cycle Interleave instruction execution Ifone thread stalls, others are executed Coarse grain multithreading Only switch on long stall (e.g., L2 cache miss) Simplifies hardware, but doesn t hide short stalls (eg, data hazards)

33 Fine Grain Multithreading Switch CPU Threads with minimal (zero?) overhead Multithreading now helps resolve fine grain dependencies (e.g. forwarding?) Inst a IF ID EX MEM WB Inst M IF ID EX MEM WB Inst b IF ID EX MEM WB Inst N IF ID EX MEM Inst c IF ID EX Inst P IF ID

34 Fine Grain Multithreading What about cache misses? Inst a IF ID EX M MEM MISS Miss WB Miss WB Inst M IF ID EX MEM WB Inst b IF ID (ID) EX MEM (ID) WB EX Inst N IF ID EX MEM Inst c*** IF (IF) ID EX ID Inst P IF () ID IF This has the advantage of simplicity

35 Coarse Grain Multithreading Alternatively, if 1 CPU thread stalled, issue every clock from alternate thread Inst a IF ID EX M MISS Miss Miss WB Inst M IF ID EX MEM WB Inst b IF ID (ID) (ID) EX Inst N IF ID EX MEM Inst P IF ID EX Inst Q IF ID

36 CPU Support for Fine Grain MT VA Mapping A Address Translation VA Mapping B Data Cache Inst Cache PC A Write Logic Fetch Mem LoL ogic Fetch Exec LoL ogic Decode Fetch L Logic Fetch Logic PC B GPRs A GPRs B

37 Simultaneous Multithreading In multiple issue dynamically yscheduled processor Schedule instructions from multiple threads Instructions from independent threads execute when function units are available Within threads, dependencies handled by scheduling and register renaming Example: Intel Pentium 4 HT Two threads: duplicated registers, shared function units and caches

38 Multithreading Example

39 Simultaneous MultiThreading Let s look simply at instruction issue: Inst a IF ID EX MEM WB Inst b IF ID EX MEM WB Inst M IF ID EX MEM WB Inst N IF ID EX MEM WB Inst c IF ID EX MEM WB Inst P IF ID EX MEM WB Inst Q IF ID EX MEM WB Inst d IF ID EX MEM WB Inst e IF ID EX MEM WB Inst tr IF ID EX MEM WB

40 Simultaneous Multi Threading permit different threads to occupy the same pipeline stage at the same time This makes most sense with superscalar issue Inst Cache Data Cache PC A PC B Inst Iss sue Logic Decode+ Fetc +Registers h Logic s Fetc Mem h Logic Write Logic

41 SMT issues Asymmetric pipeline stall One part of pipeline stalls we want other pipeline to continue Overtaking want unstalled thread to make progress Pipeline overcrowding may need extra wide pipeline registers (why?) Existing implementations (mainly) on O o O, register renamed architectures

42 Future of Multithreading Will it survive? In what form? Power considerations simplified microarchitectures Simpler forms of multithreading Tolerating cache miss latency Thread switch may be most effective Multiple simple cores might share resources more effectively Chapter 7 Multicores, Multiprocessors, and Clusters 42

43 Instruction and Data Streams An alternate classification Data Streams Single Multiple Instruction Single SISD: SIMD: SSE Streams Intel Pentium 4 instructions of x86 Multiple MISD: No examples today MIMD: Intel Xeon e5345 SPMD: Single Program Multiple Data A parallel program on a MIMD computer Conditional code for different processors

44 SIMD Operate elementwise on vectors of data E.g., MMX and SSE instructions in x86 Multiple data elements in 128 bit wide registers All processors execute the same instruction at the same time Each with different data address, etc. Simplifies synchronization Reduced instruction control hardware Works best for highly data parallel applications

45 Vector Processors Highly ypp pipelined function units Stream data from/to vector registers to units Data collected from memory into registers Results stored from registers to memory Example: Vector extension to MIPS element registers (64 bit elements) Vector instructions lv, sv: load/store vector addv.d: add vectors of double addvs.d: add scalar to each element of vector of double Significantly reduces instruction fetch bandwidth

46 Example: DAXPY (Y = a X + Y) Conventional MIPS code l.d $f0,a($sp) ;load scalar a addiu r4,$s0,#512 ;upper bound of what to load loop: l.d $f2,0($s0) ;load x(i) mul.d $f2,$f2,$f0 ;a x(i) l.d $f4,0($s1) ;load y(i) add.d $f4,$f4,$f2 ;a x(i) + y(i) s.d $f4,0($s1) ;store into y(i) addiu $s0,$s0,#8 ;increment index to x addiu $s1,$s1,#8 ;increment index to y subu $t0,r4,$s0 $ ;compute bound bne $t0,$zero,loop ;check if done Vector MIPS code l.d $f0,a($sp) ;load scalar a lv $v1,0($s0) ;load vector x mulvs.d $v2,$v1,$f0 ;vector-scalar multiply lv $v3,0($s1) ;load vector y addv.d $v4,$v2,$v3 ;add y to product sv $v4,0($s1) ;store the result

47 Vector vs. Scalar Vector architectures and compilers Simplify data parallel programming Explicit statement of absence of loop carried dependences Reduced checking in hardware Regular access patterns benefit from interleaved and burst memory Avoid control hazards by avoiding loops More general than ad hocmedia extensions (such as MMX, SSE) Better match with compiler technology

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy