MEMORY HIERARCHY DESIGN

Size: px

Start display at page:

Download "MEMORY HIERARCHY DESIGN"

Berenice Berry
6 years ago
Views:

1 1 MEMORY HIERARCHY DESIGN Chapter 2

2 OUTLINE Introduction Basics of Memory Hierarchy Advanced Optimizations of Cache Memory Technology and Optimizations 2

3 READINGASSIGNMENT Important! 3 Read Appendix B

4 INTRODUCTION Memory Challenges Programmers want unlimited amounts of fast memory; however, fast memory is expensive Improvement of processor performance is much faster than memory! 10-25K X 30-80X Memory Gap! X 6-8X 4

5 INTRODUCTION Efficient Solution? Memory hierarchy Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor Entire addressable memory space available in largest, slowest memory Whyitworks? Spatial and Temporal Locality 5

6 INTRODUCTION Ideally one would desire an indefinitely large memory capacity such that any particular word would be immediately available. We are forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible A. W. Burks, H. H. Goldstine, and J. von Neumann

7 INTRODUCTION Memory hierarchy design becomes more crucial with recent multi-core processors (proportional to # cores) Intel Core i7 with four cores at 3.2 GHz Can generate two data references per core per clock cycle 25.6 billion 64-bit data references/second 12.8 billion 128-bit instruction references/sec GB/s! DRAM bandwidth is only 6% of this (25 GB/s)???? Several Caching Techniques Separate instruction and data cache at first level Multi-port, pipelined caches Two separate levels of cache per core Shared third-level cache on chip Memory performance and Power??? 7

8 BASICS OFMEMORYHIERARCHIES Cache Highest and fastest level in the hierarchy (SRAM) First introduced in research computers in early 1960s Upon memory reference by the processor The cache is checked first When the referenced item is found in the cache, a hit occurs and the lower level is not accessed When the referenced item is not found in the cache, a miss occurs, and the lower level is checked High cost to retrieve Retrieve adjacent words as well (block) 8

9 BASICS OFMEMORYHIERARCHIES Wheretoplaceblocksinthecache? Direct-mapped cache Oneblockper set The block is placed at the same one location each time it is brought the cache #Blocks in Cache = cache size/blocksize Cache block # =(block address) modulo (# of blocks in the cache) Fully-associative cache A blockcan go any wherein the cache (random,lru..) Has one set Set-associative cache The cache is dividedintosets of blocks n blocksin a set n-way set associative A blockis placedanywhere withinaset Cache Set # = Block Address MOD #Sets 9

10 BASICS OFMEMORYHIERARCHIES Wheretoplaceblocksinthecache? 10

11 BASICS OFMEMORYHIERARCHIES Example Cachesize=32KB Blocksize=32Bytes Byteaddress=8160 Block address = Byte address/#bytes = 8160/32 = 255 1) Fully-associative any place 2) Directmapped Cache Block#=255 MOD (32 KB/32B) = 255 3) 4-Way Set Associative Cache Cache Set #= 255 MOD (32KB/(4*32)) =255 11

12 BASICS OFMEMORYHIERARCHIES Howtofindablockinthecache? Store the address or part of it in the cache as well Tag : part of the memory address. Checked to know if the block contains the required information Identify and select block Select set Log(#sets) Select data within block Log(#bytes in block) Validbit Indicate whether the cache block has valid information 12

13 BASICS OFMEMORYHIERARCHIES 1 K Word Direct-mapped Cache(1 word/block) Byte offset Hit Tag Index IndexValid Tag Data Data 13

14 Hit Data BASICS OFMEMORYHIERARCHIES 4-Way Set-Associative Cache Byte offset Tag 22 Index 8 IndexV Tag Data V Tag Data V Tag Data V Tag Data 32 4x1 select 14

15 BASICS OFMEMORYHIERARCHIES The cache can t have all data found in lower level! Which block to remove from the cache? Direct mapped: no choice! Set and fully associative? Least-recently used (LRU) Choose the one unused for the longest time Simple for 2-way, manageable for 4-way, too hard beyond that Random Gives approximately the same performance as LRU for high associativity FIFO Approximates LRU 15

16 BASICS OFMEMORYHIERARCHIES Cachingdatathatisonlyreadiseasy! Writeis not alloweduntilthe validbit is checked Writes are of different size! Onwritehit: A write-through cache updates the item in the cache and writes through to update main memory A write-back cache only updates the copy in the cache When the block is about to be replaced, it is copied back to memory(dirty Bit!) Both strategies require long memory access Use write buffers Processor Cache DRAM write buffer 16

17 BASICS OFMEMORYHIERARCHIES What to do on a write miss? Write allocate fetch block into the cache. Has miss penalty like read miss! No-write allocate the block is modified only in the lower-level memory 17

18 BASICS OFMEMORYHIERARCHIES 18 Caching on Write?

19 BASICS OFMEMORYHIERARCHIES One measure of the benefits of different cache organizations is miss rate The fraction of cache accesses that result in a miss CausesofMiss Compulsory The firstaccesstoablockthat is not in the cache Capacity The cachecan not containall the blocks Conflict Multipleblocksmapsto the samelocationin cache Multithreading and multi processors add the fourth C which is Coherency 19

20 BASICS OFMEMORYHIERARCHIES A better measure is the average memory access time (AMAT) AMAT = Hit Time + Miss Rate x Miss Penalty Where hit time is the time to hit in the cache miss penalty is the time to replace the block from memory (lower level!) Average memory access time is still an indirect measure of performance but it is not a substitute for execution time Speculative and multithreaded processors may execute other instructions during a miss 20

21 BASICS OFMEMORYHIERARCHIES In terms of execution time, cache performance can be assessed using Where To accommodate for different miss penalty and rate for writes 21

22 BASICS OFMEMORYHIERARCHIES (1+0.5)?? 22

23 VIRTUALMEMORY Computers are running multiple processes concurrently Expensive to dedicate the entire space for a single process. Sharing!! Virtual memory Divides physical memory into blocks and allocates them to different processes Allows sharing and protection, dynamic management and code relocation 23

24 VIRTUALMEMORY Processors deal with virtual addresses (VA)! Why! NeedtotranslateVAtoPA A pagetable is loadedin memoryis used Expensive! Use Cache for translation (TLB). Note the cache is physically addressed? Why? 24

25 VIRTUALMEMORY Where Can a Block/page Be Placed in Main Memory? Fully associative How Is a Block Found If It Is in Main Memory? Page table Which Block Should Be Replaced on a Virtual Memory Miss? LRU by OS and the USE bit provided by the processor What Happens on a Write? Write-back! 25

26 BASICS OFMEMORYHIERARCHIES Cache optimization techniques Split Cache In pipelined processors, a processor may request data and instruction in the same cycle!? (Structural Hazard) Use split cache! Improves the bandwidth of the cache Can optimize each cache separately (associativity, block size and capacity) However, this fixes the cache size for each type! Check example on B-16 26

27 BASICS OFMEMORYHIERARCHIES Example(Split Cache) Split Cache Unified Cache Instruction Cache miss rate 1% Miss rate 2.5% Data Cache miss rate 2% Miss penalty 200 cycles 30% Memory instructions Solution T Uni = IC ( ) CC = 7.5 IC CC T split = IC ( ) CC = 4.2 IC CC Speedup = T uni / T split = 7.5/4.2 =

28 BASICS OFMEMORYHIERARCHIES Cache optimization techniques Larger block size Reduces miss rate Reduces compulsory misses Reduce static power However, increases Miss penalty Capacity and conflict misses 28 Check Example on p. B-26

29 BASICS OFMEMORYHIERARCHIES Cache optimization techniques Bigger caches to reduce miss rate Cost?? Increases hit time Increases static and dynamic power consumption Higher associativity to reduce miss rate Reduces conflict misses Increases hit time Increases power consumption Example p. B-29 29

30 BASICS OFMEMORYHIERARCHIES Cache optimization techniques Multilevel caches to reduce miss penalty Reduces overall memory access time More power efficient AMAT? Multilevel inclusion and exclusion! 30

31 BASICS OFMEMORYHIERARCHIES Cache optimization techniques Giving priority to read misses over writes Reduces miss penalty Little impact on power Avoiding address translation in cache indexing Processors work with virtual addresses while caches use physical Translation is required before accessing the cache? Page Table! Reduces hit time Use TLB! 31

32 ADVANCED OPTIMIZATIONS OF CACHE Factors that are considered Hit time, Miss rate, Miss penalty, Bandwidth and power Techniques can be classified Category Technique Impacton power Reducing the hit time Increasing cache bandwidth Reducing the miss penalty Small and simple first-level caches Way prediction Pipelined caches Multibanked caches None-blocking caches Critical word first Merging write buffers Decrease Varying Little Reducing the miss rate Compiler optimizations Improves Reducing the miss penalty or miss rate via parallelism Hardware prefetching Compiler prefetching Increase 32

33 ADVANCED OPTIMIZATIONS OF CACHE 1. Small and Simple First-Level Caches The total amount of on-chip cache has increased dramatically However, the amount of L1 cache has recently increased eitherslightly or not at all Faster clocks demand faster and smaller caches Yet, smaller caches increase capacity and conflict misses Alternatively, designers have opted for more associativity than larger caches However, power should be considered in this case! 33

34 BASICS OFMEMORYHIERARCHIES Cache size and Hit time 34

35 BASICS OFMEMORYHIERARCHIES Cache size and energy 35

36 ADVANCED OPTIMIZATIONS OF CACHE 2. Way Prediction Extra bits are used to predict the way or the block within the set of the next cache access The MUX is preset and one tag comparisonis performed A miss results in checking the other blocks for matches in the next clock cycle Accuracy > 90% for two-way (more popular) > 80% for four-way I-cache has better accuracy than D-cache First used on MIPS R10000 in mid-90s Used on ARM Cortex-A8 Extension Way selection! Saves power when correct! Increases miss-prediction penalty Example on P.82 36

37 ADVANCED OPTIMIZATIONS OF CACHE 3. Pipelined Caches Pipeline cache access such that the access of the firstlevel cache is over multiple cycles Gives fast clock cycle time and high bandwidth but slow hits Examples Pentium: 1 cycle Pentium Pro Pentium III: 2 cycles Pentium 4 Core i7: 4 cycles Increases branch miss-prediction penalty Makes it easier to increase associativity!? 37

38 ADVANCED OPTIMIZATIONS OF CACHE 4. Nonblocking Caches For pipelined computers that allow out-of-order execution the processor need not stall to on D-cache miss The processor can continue to fetch instructions from I- cache while waiting the D-cache to return missing data A nonblocking cache allows D-cache to continue to supply cache hits during a miss to increase bandwidth Two techniques Hit Under Miss. Processor keeps running until another miss occurs. Simple implementation and reduces miss penalty Hit Under Multiple Misses. Overlap multiple misses (Intel Core i7). Beneficial only if the memory system is multi-banked and can service multiple misses (parallel/pipelined). Complex! Performance evaluation! In general, processors can hide L1 miss penalty but not L2 miss penalty 38

39 ADVANCED OPTIMIZATIONS OF CACHE 4. Nonblocking Caches Cache access latency improvement 39

40 ADVANCED OPTIMIZATIONS OF CACHE 4. Nonblocking Caches Example. For a 32 KB data cache that implements one hit under miss, the cache latency is 85% of the direct mapped cache. Given the numbers below, would a 2-way set associative cache has better latency than the hit under one miss? Assume L2 miss penaltyof10 cycles. Solution Direct 2-way Miss rate 5.2% 4.9% AMAT DM = *10=1.52 AMAT 2-Way = *10=1.49 AMAT 2-Way /AMAT DM = 1.49/1.52 = 98% Hitunderonemissisbetter! 40

41 ADVANCED OPTIMIZATIONS OF CACHE 5. Multibanked Caches Divide the cache into independent banks to allow simultaneous access (originally used in DRAM) ARM Cortex-A8 supports 1-4 banks for L2 Intel i7 supports 4 banks for L1 and 8 banks for L2 Spread the addresses of the block sequentially across the banks (sequential interleaving) Banking works best when the accesses naturally spread themselves across the banks Multiple banks also are a way to reduce power! 41

42 ADVANCED OPTIMIZATIONS OF CACHE 6. Critical Word First and Early Restart Generally, a processor needs one word from a block Two strategies to reduce miss penalty Critical word first request the missed word from memory and send it to the processor. The processor continues filling the rest of the block Early restart fetch all words in the block normally, but resume execution as soon as the requested word arrives Beneficial with large blocks! Issues Sizeof block! Effect of spatial locality! Bandwidth of memory system Calculation of miss penalty! 42

43 ADVANCED OPTIMIZATIONS OF CACHE 7. Merging Write Buffer Both write-through and write-back caches use write buffer to reduce miss penalty Write merging - when storing to a block that is already pending in the write buffer, update write buffer If the buffer is full and there is no address match, the cache (and processor) must wait until the buffer has an empty entry Reduces stalls due to write buffer being full No Write Merge With Write Merge 43

44 ADVANCED OPTIMIZATIONS OF CACHE 8. Compiler Optimization Apply optimization techniques at compile time to take advantage of memory hierarchy Applied to improve instruction misses and data misses Two examples Loop Interchange Blocking 44

45 ADVANCED OPTIMIZATIONS OF CACHE 8. Compiler Optimization (Loop interchange) Exchange the order of nested loops to allow sequential access to datain the way they are stored. Assuming arrays are big to fit in cache, this techniques reduces misses by improving spatial locality Example. Let x be a two-dimensional array of size [5000,100] allocated so that x[i,j] and x[i,j+1] are adjacent (row major). Block = 4 words. Ignore conflict and compulsory misses. Memory access skip through memory in strides of 100 words Misses!? 45

46 ADVANCED OPTIMIZATIONS OF CACHE 8. Compiler Optimization (Blocking) Blocked algorithms operate on submatrices or blocks to improve temporal locality! Example. Multiplication of y and z to obtain x /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; }; Two Inner Loops: Read all NxNelements of z[ ] Read N elements of 1 row of y[ ] repeatedly Write N elements of 1 row of x[ ] Capacity Misses can be represented as a function of N & Cache Size: 3xNxN => no capacity misses 46

47 ADVANCED OPTIMIZATIONS OF CACHE 8. Compiler Optimization (Blocking) Example - continued Use a block size of BxB /* After */ for (jj = 0; jj < N; jj = jj+b) for (kk = 0; kk < N; kk = kk+b) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+b-1,n); j = j+1) {r = 0; for (k = kk; k < min(kk+b-1,n); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; }; B is called the Blocking Factor 47

48 ADVANCED OPTIMIZATIONS OF CACHE 8. Compiler Optimization Loop fusion Array merging 48

49 ADVANCED OPTIMIZATIONS OF CACHE 9. Hardware Prefetching Prefetch instruction and data to reduce miss penalty or miss rate Prefetch instructions and/or data directly into the cache or into an external buffer (stream buffer) Instruction prefetch is frequently done in hardware outside of the cache On a miss, the processor fetches two blocks; the requested block and the next consecutive block The requested block is placed in the instruction cache and the prefetched block is placed into the instruction stream buffer The Intel Core i7 supports hardware prefetching into both L1 and L2 Prefetching relies on utilizing memory bandwidth that otherwise would be unused Power! 49

50 ADVANCED OPTIMIZATIONS OF CACHE 9. Hardware Prefetching 50

51 ADVANCED OPTIMIZATIONS OF CACHE 10. Compiler-controlled Prefetching Reduces miss rate or miss penalty! The compiler inserts prefetch instructions to request data before the processor needs it Goal Overlap execution with the prefetching of data (useful in Loops) Useful when The processor is able to proceed during fetch The cache does not stall while data is being prefetched! Two flavors Register prefetch will load the value into a register. Cache prefetch loads data only into the cache and not the register Issuing prefetch instructions incurs an instruction overhead that might outweigh the benefits 51

52 ADVANCED OPTIMIZATIONS OF CACHE 10. Compiler-controlled Prefetching Example(p. 93). Consider the following code. Assume 8 KB direct-mapped write-back (allocate) cache with 16 byte blocks. Arrays a and b are 3x100 and 101x3 double precision (elements 8 bytes long) and stored as rowmajor. Which accesses causes misses? Insert prefetch instructions to reduce misses. Calculate the number of prefetch instructions executed and the misses avoided by prefetching. 52

53 ADVANCED OPTIMIZATIONS OF CACHE 10. Compiler-controlled Prefetching Example(p.93). Access that cause misses array a access is in the same order of storage. Since the block size is 2, then even j misses while odd j hits 3x100/2 misses array b access is column wise b benefits twice from temporal locality we will miss all elements in b 101 misses Totalmissesfortheloop

54 ADVANCED OPTIMIZATIONS OF CACHE 10. Compiler-controlled Prefetching Example(p.93). Insert prefetch instructions to reduce misses Ignore prefetching at the beginningand the end of the loop 19 misses only! 54

55 ADVANCED OPTIMIZATIONS OF CACHE 55

56 MEMORY TECHNOLOGY AND OPTIMIZATIONS Performance measures of main memory emphasize latency and bandwidth. Latency is concern for cache Access time (time between word request and its arrival) Cycle time (time between unrelated requests to memory) Bandwidth is concern of multiprocessor and I/O Generally, it is easier to improve bandwidth than latency Multilevel caches and larger blocks make BW important to caches Cache optimizations reduced the processor memory gap but did not eliminate it Innovations started happening inside the DRAM chips DRAMs for main memory and SRAM for caches 56

57 MEMORY TECHNOLOGY AND OPTIMIZATIONS SRAM Technology 6 transistors/bit Used in the three levels of caches SRAMs don t need to refresh! The access time is very close to the cycle time Faster than DRAM (3-5 times) Low power to retain bit in standby mode Low density and more expensive! 57

58 MEMORY TECHNOLOGY AND OPTIMIZATIONS DRAM Technology One transistor/bit Reading a bit destroys it! Must be re-written after being read! Must also be periodically refreshed to prevent loss Bits in the same row can be refreshed simultaneously Every ~ 8 ms Address lines are multiplexed to reduce address lines Upper half of address: Row Access Strobe (RAS) Lower half of address: Column Access Strobe (CAS) DRAMs are commonly sold on small boards called dual inline memory modules (DIMMs). DIMMs typically contain 4 to 16 DRAMs chips normally organized to be 8 bytes wide 58

59 MEMORY TECHNOLOGY AND OPTIMIZATIONS DRAM Technology Amdahl rule of thumb Memory capacity should grow linearly with processor speed Unfortunately, memory capacity and speed has not kept pace with processors Performance Improvement 5% improvement in RAS (Latency) 12% improvement in CAS (BW) Capacity DRAMs has been a slowing down in capacity growth DRAMs obeyed Moore s law for 20 years (4x/3 years) 2x/2 years since x/4 years

60 MEMORY TECHNOLOGY AND OPTIMIZATIONS DRAM Technology 60

61 MEMORY TECHNOLOGY AND OPTIMIZATIONS DRAM Technology 61

62 MEMORY TECHNOLOGY AND OPTIMIZATIONS DRAM Optimizations Repeated access to the row buffer without another row access time Synchronous DRAM (SDRAM) Add a clock signal to the DRAM to avoid synchronization overhead Burst mode with critical word first Wider interfaces From 4-bit transfer mode to up to 16-bit Double Data Rate (DDR) Transfer data on both the rising edge and falling edge of the DRAM clock signal DDR, DDR2, DDR3 and DDR4 (voltage and clock), DDR5 Multiple banks on each DRAM device Banks accessed independently to improve BW 62

63 MEMORY TECHNOLOGY AND OPTIMIZATIONS Graphics DRAM GDRAM) Based on SDRAM designs but tailored for handling the higher BW demands of graphics processing units. GDDR5 is based on DDR3 with earlier GDDRs based on DDR2 GDDRs have wider interfaces: 32-bits versus 4, 8, or 16 in current designs Higher maximum clock rate on the data pins 2-5 X bandwidth per DRAM versus DDR3 DRAMs 63

64 MEMORY TECHNOLOGY AND OPTIMIZATIONS Reducing Power in SDRAM Lower operating voltage ( V) Banking Power down mode Ignore the clock Disables the clock except for internal automatic refresh 64

65 MEMORY TECHNOLOGY AND OPTIMIZATIONS Flash Memory Type of EEPROM (Non volatile) Must be erased in blocks before being overwritten (flash) Limited number of write cycles (100,000) Cheaper than SDRAM, more expensive than disk $2/GB for Flash, $20 to $40/GB for SDRAM, and $0.09/GB for disks Slower than SRAM, faster than disk 65

66 MEMORY TECHNOLOGY AND OPTIMIZATIONS Enhancing Memory Dependability Large caches and main memories significantly increase the possibility of errors Two types of errors Soft or dynamic errors change in cell s content Hard errors change in circuitry during fabrication or operation Soft errors can be detected and fixed by ECC Hard errors can be accommodated using spare rows that can be programmed 66

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology