CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)

Size: px

Start display at page:

Download "CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)"

Ariel Fields
6 years ago
Views:

1 CSE 4201 Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)

2 Memory Hierarchy We need huge amount of cheap and fast memory Memory is either fast or cheap; never both. Do as politicians do: fake it Give a little bit of fast memory and tons of cheap memory. As technology progresses Cheap becomes cheaper rapidly Fast becomes faster rapidly Cheap does not become fast, or fast cheap

3 Since the 80's... Processors became 10,000 times faster Memory became 10 times faster Back then cache was for high performance systems Now we need multiple levels of cache

4 Addressing Scheme Cache Index Tag Mapping Address Block Offset Set Block Address Match? Block

5 Set-Associative (Set Address) = (Block Address) MOD K Where K is the number of sets in kache (Block Address) = (Address) DIV b Where b is the number of bytes in the block (Block Offset) = (Address) MOD b Set has n blocks. (n-way associative) Every block has data and address (tag). If K=1, fully associative If n=1, direct mapped

6 Victim Selection Which block to expel to make room for new entry Least recently used Random First In First Out All work more or less the same. LRU is rarely exact, almost always approximate Little effect on big caches About 10% for smaller

7 What Happens on a Write? Writes are less common than reads All instruction fetches are reads Stores are 10% of the instructions, loads, 25% We have 10 writes for every 125 reads Better take good care of the reads Writes are costlier than reads We write 1-8 bytes at a time in a block typically bytes long Have issues with consistency

8 Write Through, Back Write through No need to write back on a cache miss No need to have dirty bit Write back Less bus traffic

9 Write Through, Back Write through <-> no-write-allocate Allocate cache block only on reads Multiple writes w/o immediate read do not disturb cache Write back <-> write-allocate Makes subsequent reads fast

10 AMD Opteron Cache L1 cache (data): 64 KB, 64 byte blocks 1 K blocks 2-way associative 512 sets. LRU Write back, write allocate 64 bogo-bits address 48 virtual, 40 physical

11 AMD Opteron Cache Various sizes: Physical address: 40 bits Block address: 34 bits Block offset: 6 bits Cache index: 9 bits Tag: 25 bits Size of set: 2 blocks (2-way set associative) Number of clock cycles: 2 (2 stalls on hazard)

12 AMD Opteron Cache Steps of cache hit: The 40 bits are split int tag(25), index(9), offset(6) A set (2 blocks) is retrieved using the index Their tags are compared and their valid bits checked The correct block is selected The 3 MSBits of the offset are used to select the word to be read/written. Update the LRU bits

13 AMD Opteron Cache Steps of cache miss Same up until we know it is a miss Identify a victim (LRU) If victim dirty write back If read, stall until next level responds, if write continue (provisionally)

14 Miss Rate Not your elementary school teacher The three Cs Compulsory (the first time) Capacity (reached the maximum number of blocks in the set) Conflict (when the blocks have to share the same spot) We may add one more: coherency

15 Sneaky Miss Rate Miss Rate can be misleading Defined as misses per (1000) access(es) Our delay is related to misses per instruction Misses per instruction is the miss rate times the memory accesses per instruction. Even this can be misleading We want to reduce the delay

16 What is the delay Avg. Mem. Access Time = Hit Time + Miss Rate * Miss Penalty We do better by decreasing any of the three quantities in the right hand side Unfortunately, these always involve trade-offs And, they are just an approximation of the effect on the execution time.

17 Complications... What exactly is a miss in speculative execution? How much does a miss cost under dynamic scheduling? Under multi-threading? If we allow a miss over miss?

18 Example Effective Access time for 16KB+16KB split cache Miss per 1000 instructions: 3.82 (instr. Cache), 40.9 (data cache) Mix: 36% of instructions are load/store Hit: 1 cycle, Miss: 100 cycles

19 Example Instruction miss rate: Data miss rate: /0.36=0.114 Percent of references that are instructions 100/136 = 74% Avg. Mem. Acc. Time: 74%*( *100) + 26%*( *100)

20 Miss Penalty under Dynamic Execution Is it the full latency? Is it the exposed latency? What about the latency due to contention by speculative instructions Any form of latency has the same problem Simple (simplistic) solution Find which instruction did not commit in time Attribute the stall to it

21 How to Increase Performance Larger cache Obviously reduces misses Increases cost, power Increases hit time Larger block Decreases compulsory (initial) misses Better exploits spatial locality Decreases number of blocks Increases miss penalty, bus traffic

22 How to Increase Performance Higher associativity Reduce conflict misses Increase hit time, silicon area, power consumption Multilevel cache Reduces hit time and miss penalty Increase cost and power Give priority to read misses Let reads jump the queue Overlap TLB and cache read...

23 TLB and Cache Cache understands physical addresses We have to consult the TLB to convert a virtual address to physical How about if we overlap the two? When is such a thing possible?

24 What is the Trick? TLB is a small cache that associates a (virtual) page number to (physical) frame number The page offset and the frame offset are the same and need no translation If the page offset is enough to index the set in the cache We do not need any bit from the frame number We can retrieve the set while the TLB does the translation When the TLB is done we compare the tags

25 This is the Trick Cache Index Tag Mapping Physical TLB Block Offset Virtual Set Block Address Match? Block

26 Disadvantages Cache size = Page size * associativity Usually we want a medium size page and a large cache. There are ways to deviate from this rule with extra hardware.

27 11 Advanced Optimizations We organize them in 5 groups Reduce Hit Time Increase bandwidth Reduce Miss Penalty Reduce Miss Rate Prefetching

28 Small is Beautiful Small and simple caches are faster Reduce size Reduce associativity Rely on L2 cache L1 cache sizes do not change much with technology

29 Way Prediction Tag comparison costly Store, along with the data, prediction bits for the next access The index is augmented by the predictor bits The data is sent to the CPU while we check the tags If the tags do not match, we send an Oops! Pentium 4 uses it

30 Example Hit rate 85% (typical) Hit: 1 cycle Miss: 3 cycle Without: 2 cycle.85*1+.15*3 = 1.3 < 2

31 Trace Cache Seems so devious... It is almost Harry-Potterish The cache contains dynamic trace (sequence of instructions as they are executed) Branch prediction is folded into the cache Pentium 4 uses it for its micro-operations cache

32 Cache Pipeline Most caches have more than 1 cycle Pipelining is tried and true Embed the cache pipeline into the CPU pipeline Pentium 4 takes 4 cycles (despite way prediction, etc)

33 Non-Blocking Miss Allow hit under miss Or multiple miss under miss FP intensive programs benefit from multiple miss under miss Dynamic execution benefits from it

34 Multi-Banked Cache Multi-bank (aka interleaved) memories were always popular Suits best for L2 cache Allows each bank to be smaller Allows each to work independently Increase bandwidth AMD Opteron has 2 banks, Sun T1 has 4

35 Critical Word First Critical (the one we asked) first If the block is transmitted in multiple cycles Early restart Do not wait for the whole block to arrive

36 Merging Write Buffer A write miss might be in the (victim) write-back buffer Similar idea to victim buffer (virtual memory)

37 Compiler Re-ordering Try to access arrays the way they are in the cache The magic behind fast matrix multiplication (blocking) Break the matrix into pieces that are comfortably fitting in the cache

38 Prefetching Hardware If two misses in the same page, prefetch Most prefetch instructions from the instruction cache Opteron and P-4 prefetch data too. Compiler Insert special prefetch instructions Needs non-blocking cache Increase traffic

39 Memory Technologies SRAM Static RAM Big transistors optimized for speed DRAM Cheap capacitors Optimized for density Reads destructively Needs refreshing

40 DRAMs Rule the Desktop Memory size and CPU speed grow at the same speed It always took about a second to scan the whole memory. Through most of their history increased 4-fold every three years Now increase 4-fold every four years. Speed increases about 5% per year.

41 Data-I/O DRAM Organization Column Decoder Memory array Sense Amps Address Buffer Row Decoder

42 RAS and CAS Row Address Strobe Column Address Strobe First goes RAS Whole row is copied out CAS selects the bit or bits

43 Improving RAM Fast Page Mode Increment CAS several times with the same RAS Make use of the modularity available Memory is organized in blocks 1-4 Mbits each for manufacturing reasons. Naturally interleaved

44 SDRAM, DDR Synchronous DRAM Shares the clock with the CPU No synchronization overhead in communication. DDR Double Data Rate Front end of memory is fast Heavily interleaved back-ends

45 Virtual Memory Expand RAM to disk (not that useful today) Allow multiple processes to share the physical memory Allow arbitrary mapping File I/O, shared memory, dynamic libraries, etc Critical to security

46 Security Virtual memory handled through the kernel Page tables can be manipulated only in monitor mode A process does not have the means to access the space of another process

47 However... A kernel is a huge program Huge programs have bugs Most bugs cause the system to crush A few of them are exploits.

48 A better way... Use virtual machines Much smaller Fewer bugs One extra level of protection Vms have other advantages as well Share a computer Cloud computing Can migrate a live program to different H/W

49 VMM Virtual Machine Monitors (hypervisors) Allow a guest OS to run efficiently as a process on a host OS User level code runs natively System calls are trapped and emulated VMM mediates between the guest OS and the H/W on the host Network connection, USB device management, etc Filesystem and state.

50 ISA Support An ISA supporting virtualization is called virtualizable Virtualization is a new idea (geologically speaking) Attempts by guest to execute privileged instructions result in traps The problem is that not all relevant instructions result in traps And handling virtual memory is tricky

51 Virtual Memory for Virtual Machines Normally we distinguish between Virtual memory Physical memory Now we have an intermediate level Real memory Guest OS maps virtual to real VMM maps real to physical

52 Shadow Page Table Two step process Too slow Interferes with h/w assisted virtual memory Shadow tables do it in one shot But this means guest OS cannot manage the page tables of its own processes TLBs must have PID tags and/or be flushed on context switch IBM ('70s) had one more level of indirection

53 Virtualized I/O There are far too many devices and drivers to handle I/O happens with the mediation of H/W, so it would be too slow to handle with emulation Solution: generic devices for each type of I/O Network: time shared or NATed.

54 Example: Xen Instead of trying to emulate everything just to trick the guest Allow small modifications to the guest to keep things simple. It is called paravirtualization and Xen is the most popular example (VMWare is another)

55 The Tricks of Xen Augment kernel E.g. 1% of Linux is modified Uses the protection levels of x86 Xen at level 0 (highest), guest OS at 1, apps at 3 Wraps I/O devices in special virtual machines (driver domains) and talks to them with page remapping

56 VMM and ISA Designers of ISAs were cheapos! To save a couple of bits had the same instruction behave differently in monitor mode and user mode POPF (pop flags) ignores privileged flags in user mode 70's technology IBM-370 is still the golden standard.

57 Cache and I/O Should we do I/O with the cache? Get the data immediately with perfect consistency Slows down processor, infects cache I/O directly with memory Most popular Works well with write through (no stale data) Or can mark pages as non-cacheable Or flush cache Or send cache invalidations

58 Fallacy: Predicting cache performance Miss rates vary by a factor of 10,000 or more Tremendous difference between instruction and data miss rates

59 RAMBUS promises RAMBUS had a bandwidth 8 times higher than competition Performance was only 0-15% faster overall Cost was 2-3 times higher (20% larger die) The reason was that most of the traffic is at the L2 cache.

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive