Memory Hierarchy Design

Size: px

Start display at page:

Download "Memory Hierarchy Design"

Cameron Miller
5 years ago
Views:

1 Advanced Computer Architecture Memory Hierarchy Design

2 Some slides are from the instructors resources which accompany the 6 th and previous editions. Some slides are from David Patterson, David Culler and Krste Asanovic of UC Berkeley; Israel Koren of UM Amherst, and Milos Prvulovic of Georgia Tech. Otherwise, the source of the slide is mentioned at the bottom of the page. Please send an if a name is missing in the above list.

3 Introduction 3

4 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory Solution: organize memory system into a hierarchy Entire addressable memory space available in largest, slowest memory Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor Temporal and spatial locality insures that nearly all references can be found in smaller memories Gives the allusion of a large, fast memory being presented to the processor 4

5 Memory hierarchy: personal mobile device (PMD) 5

6 Memory hierarchy: laptop or desktop 6

7 Memory hierarchy: server 7

8 Processor-Memory Performance Gap Performance Moore s Law µproc 60%/year DRAM 7%/year Time (year) 8

9 Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors: Aggregate peak bandwidth grows with # cores: Intel Core i7 can generate two references per core per clock Four cores and 4.2 GHz clock 32.8 billion 64-bit data references/second billion 128-bit instruction references = GB/s! DRAM bandwidth is only 8% of this (34.1 GB/s) Requires: Multi-port, pipelined caches Two levels of cache per core Shared third-level cache on chip 9

10 Memory hierarchy in embedded computers Memory hierarchy of the embedded computers different than the desktops: 1- used in real-time applications, caches improve average case performance, but can degrade worst case performance. 2- Concerned about power and battery life, hardwareintensive memory hierarchy performance not interesting. 3- embedded systems only run one application. The protection role of the memory hierarchy is often diminished. 4- the main memory itself may be quite small and no disk storage. 10

11 Memory Technology 11

Core Memory Core memory was first large scale reliable main memory invented by Forrester in late 40s/early 50s at MIT for Whirlwind project Bits stored as magnetization polarity on small ferrite

12 Core Memory Core memory was first large scale reliable main memory invented by Forrester in late 40s/early 50s at MIT for Whirlwind project Bits stored as magnetization polarity on small ferrite cores threaded onto 2 dimensional grid of wires Coincident current pulses on X and Y wires would write cell and also sense original state (destructive reads) Robust, non-volatile storage Cores threaded onto wires by hand (25 billion a year at peak production) Core access time ~ 1ms 12

13 Naive Register File 13

14 Memory Arrays: Register File 14

15 Memory Arrays: SRAM 15

16 Memory Arrays: DRAM 16

17 Semiconductor Memory, DRAM Semiconductor memory began to be competitive in early 1970s Intel formed to exploit market for semiconductor memory First commercial DRAM was Intel Kbit of storage on single chip charge on a capacitor used to hold value Semiconductor memory quickly replaced core in 70s 17

18 One Transistor Dynamic RAM 1-T DRAM Cell word V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate, trench, stack) poly word line W bottom electrode 18

19 DRAM Architecture Col. 1 bit lines Col. word lines 2 M Row 1 N+M N M Row Address Decoder Column Decoder & Sense Amplifiers Row 2 N Memory cell (one bit) Data D Bits stored in 2-dimensional arrays on chip Modern chips have around 4 logical banks on each chip each logical bank physically implemented as many smaller arrays 19

DRAM Packaging Clock and control signals Address lines multiplexed row/column address ~7 ~12 DRAM chip Data bus (4b,8b,16b,32b) DIMM (Dual Inline Memory Module) contains multiple chips with

20 DRAM Packaging Clock and control signals Address lines multiplexed row/column address ~7 ~12 DRAM chip Data bus (4b,8b,16b,32b) DIMM (Dual Inline Memory Module) contains multiple chips with clock/control/address signals connected in parallel (sometimes need buffers to drive signals to all chips) Data pins work together to return wide word (e.g., 64-bit data bus using 16x4-bit parts) 20

21 DRAM Operation Three steps in read/write access to a given bank Row access (RAS) decode row address, enable addressed row (often multiple Kb in row) bitlines share charge with storage cell small change in voltage detected by sense amplifiers which latch whole row of bits sense amplifiers drive bitlines full rail to recharge storage cells Column access (CAS) decode column address to select small number of sense amplifier latches (4, 8, 16, or 32 bits depending on DRAM package) on read, send latched bits out to chip pins on write, change sense amplifier latches which then charge storage cells to required value can perform multiple column accesses on same row without another row access (burst mode) Precharge charges bit lines to known value, required before next row access Each step has a latency of around 15-20ns in modern DRAMs Various DRAM standards (DDR, RDRAM) have different ways of encoding the signals for transmission to the DRAM, but all share same core architecture 21

22 Capacity and access times for DDR SDRAMs by year of production. Access time is for a random memory word and assumes a new row must be opened. If the row is in a different bank, we assume the bank is precharged; if the row is not open, then a precharge is required, and the access time is longer. 22

23 Improving bandwidth 1. Timing signals that allow repeated accesses to the row buffer without another row access time, typically called fast page mode. Such a buffer comes naturally, as each array will buffer bits for each access. 2. Add a clock signal to the DRAM interface, so that the repeated transfers would not bear that overhead. Synchronous DRAM. SDRAMs typically also had a programmable register to hold the number of bytes requested, and hence can send many bytes over several cycles per request. 3. To transfer data on both the rising edge and falling edge of the DRAM clock signal, thereby doubling the peak data rate. This optimization is called double data rate (DDR). 23

24 Labeling DDR DIMM A DDR DIMM is connected to a 133 MHz bus. Why is it called PC2100? Its transfer rate is: 133 MHz 2 8 bytes = 2100 MB/sec 24

Labeling DDR DIMM Example Suppose a new DDR3 DIMM is transferring data at 16000 MB/sec. What should it be named?

25 Labeling DDR DIMM Example Suppose a new DDR3 DIMM is transferring data at MB/sec. What should it be named? Answer The DIMM name should be PC The clock rate of the DIMM: Clock lock rate 2 8 = Clock rate = Clock rate = 1000 MHz. PC CL5 1066MHz DDR2 RAM 25

26 Memory Dependability Memory is susceptible to cosmic rays Soft errors: dynamic errors Detected and fixed by error correcting codes (ECC) Hard errors: permanent errors Use sparse rows to replace defective rows Chipkill: a RAID-like error recovery technique 26

27 HBM: Stacked or Embedded DRAMs Placing multiple DRAMs in a stacked or adjacent fashion embedded within the same package as the processor. DRAM die directly on the CPU die. Also called: high bandwidth memory (HBM). The 2.5D form is available now. 3D stacking is under development and faces heat management challenges due to the CPU. 27

28 Chipkill Chipkill was introduced by IBM to solve the problem memory chip failure. Similar in nature to the RAID approach used for disks, Chipkill distributes the data and ECC information, so that the complete failure of a single memory chip can be handled by supporting the reconstruction of the missing data from the remaining memory chips. Using an analysis by IBM and assuming a 10,000 processor server with 4 GB per processor yields the following rates of unrecoverable errors in 3 years of operation: 1. Parity only-about 90,000, or one unrecoverable (or undetected) failure every 17 minutes 2. ECC only-about 3500, or about one undetected or unrecoverable failure every 7.5 hours 3. Chipkill-6, or about one undetected or unrecoverable failure every 2 months 28

29 Review of the ABCs of Caches 29

30 1) cache hit 2) cache miss 3) block 4) virtual memory 5) page 6) page fault 7) memory stall cycles 8) miss penalty 9) miss rate 10) address trace 11) direct mapped 12) fully associative 13) n-way set associative 14) valid bit 15) dirty bit 16) least-recently used 17) random replacement 18) block address 19) block offset 20) tag field 21) index 22) write through 23) write back 24) write allocate 25) no-write allocate 26) instruction cache 27) data cache 28) unified cache 29) write buffer 30) average memory access time 31) hit time 32) misses per instruction 33) write stall 30

31 Miss-oriented approach to memory access 31

32 Example Assume we have a computer where the clocks per instruction (CPI) is 1.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions were cache hits? Ideal Cache Real Cache 32

33 Four Memory Hierarchy Questions For the first level of the memory hierarchy: Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy) 33

34 Locality of Reference Consider the string of references in a typical program 1. Temporal Locality 2. Spatial Locality 3. Sequential Locality 34

35 Common And Predictable Memory Reference Patterns 35

36 Memory Reference Patterns Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. Dr. IBM Shadrokh Systems Samavi Journal 10(3): (1971) 36

37 Prefetch Policy 1. Prefetch on a miss. 2. Tagged prefetch. BLOCK i BLOCK i+1 BLOCK i+2 37

38 Example: cache has 8 block frames and memory has 32 blocks. 38

39 Direct-Mapped Cache Tag Index Block Offset t V Tag k Data Block b 2 k lines = t Hit Data Word or Byte Krste Asanovic Dr. Shadrokh University Samavi of California at Berkeley 39

40 2-Way Set-Associative Cache Tag Index Block Offset b t V Tag k Data Block V Tag Data Block t = = Data Word or Byte Hit Krste Asanovic Dr. Shadrokh University Samavi of California at Berkeley 40

41 Fully Associative Cache V Tag Data Block = t Tag t = HIT Block Offset b = Data Word or Byte Krste Asanovic Dr. Shadrokh University Samavi of California at Berkeley 41

42 Disadvantage of Set Associative Cache N-way Set Associative Cache versus Direct Mapped Cache: N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss decision and set selection In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss. 42

43 Next and Last references d t (x) forward distance for block x= K b t (x) backward distance for block x= K 43

44 Replacement Policies Z(t) : existing set of blocks in cache. Q(z): the block that is to be replaced. 1. LRU Q(Z(t))=Y iff max{b t (x) } = y x Z 2. MIN Q(Z(t))=Y iff max{d t (x) } = y x Z 3. LFU 4. FIFO 5. CLOCK 6. LIFO 7. RAND 44

45 Clock Replacement Policy Initial values in LRU table Block 1 is referenced The sequence of references are: 1,2,3,4,5,6,7,8,1 after that a replacement is to be made. 45

46 Clock Replacement Policy Block 2 is referenced Block 3 is referenced

47 Clock Replacement Policy Block 4 is referenced Block 5 is referenced

48 Clock Replacement Policy Block 6 is referenced Block 7 is referenced

49 Clock Replacement Policy Block 8 is referenced Block 1 is referenced The block with all 0 s in its row is to be replaced: block 2. Block 1 was initially the candidate due to the FIFO strategy, but we gave Block 1 a second change since it has been recently used

50 Effect of different replacement policies Data cache misses per 1000 instructions 50

51 Cache Update Policies Write through (WT) WTWA WTNWA tc = cache cycle tm= memory cycle tb= block transfer time Wt = Write ratio Average time to complete a 1- WTWA: 2- WTNWA: 51

52 Cache Update Policies Write Back (TB) Simple write back (SWB) Flagged write back (FWB) Average reference time (SWB) = Average reference time (FWB) = 52

Intel Core i7-965 XE & Core i7-920 http://www.

53 Intel Core i7-965 XE & Core i

54 CINEBENCH 9.5 CPU TEST 7

55 CineBench 11.5 Score (Higher-Better) CINEBENCH is a performance suite, which utilizes CINEMA 4D for both CPU and videobased testing. CINEMA 4D is a 3D modeling, animation, motion graphic and rendering application developed by MAXON Computer GmbH in Germany.

56 Miss per 1000 instructions Size Instruction cache Data cache Unified cache 8 KB KB KB KB KB KB

57 Example: Harvard Architecture Proc Unified 32KB L1 Cache 16KB L1 I-Cache Proc 16KB L1 D-Cache Which is better? Assume 36% instructions are data transfers 74% accesses are instructions hit time=1, miss time=100 Note that data hit has 1 stall for unified cache (only one port) Miss rate 16 KB Instruction = 3.82/1000= Miss rate 16 KB data = 40.9/1000= Miss rate 32 KB unified = 43.3/1000=

58 Example: Harvard Architecture Average Memory Access Time= AMAT AMAT split =74% ( )+26% ( ) =4.24 AMAT unified =74% ( )+26% ( )=

59 Improving Cache Performance Average memory access time = Hit time + Miss rate Miss penalty How do you define miss penalty? Is it the full latency of the miss to memory, or is it just the exposed or non-overlapped latency when the processor must stall? Memory stall cycles Instruction = Misses Instruction (Total miss latency - Overlapped miss latency) 59

60 Optimization of Cache Performance AMAT= hit time+ (miss rate miss penalty) 1. Reducing the hit time Small and simple first-level caches and way-prediction. Both techniques also generally decrease power consumption. 2. Increasing cache bandwidth: Pipelined caches, multibanked caches, and nonblocking caches. These techniques have varying impacts on power consumption. 3. Reducing the miss penalty Critical word first and merging write buffers. These optimizations have little impact on power. 4. Reducing the miss rate Compiler optimizations. Obviously any improvement at compile time improves power consumption. 5. Reducing the miss penalty or miss rate via parallelism Hardware prefetching and compiler prefetching. These optimizations generally increase power consumption, primarily because of prefetched data that are unused. 60

61 1- Reducing Cache hit time 61

62 Small and Simple First-Level Caches to Reduce Hit Time and Power Access time (ns) Cache size 62

63 Example Determine whether a 32 KB 4-way set associative L1 cache has a faster memory access time than a 32 KB 2-way set associative L1 cache. Assume the miss penalty to L2 is 15 times the access time for the faster L1 cache. Ignore misses beyond L2. Which has the faster average memory access time? Answer: For 2-way, AMAT= Hit time + Miss rate Miss penalty Miss penalty= access time to L2= 15 1 = = For 4-way: hit time 1.4 times longer. Miss penalty= 15 AMAT= Miss rate Miss penalty = =

64 Energy consumption per read 64

65 Way Prediction to Reduce Hit Time Idea conflict miss rate is decreased with higher associativity but hit time goes up due to: more complex circuits needed to select and mux the right set member Hence organize the cache as n-way set associative, but use a predictor to say which way to look in rather than the whole set hence hit times are the same as a simple direct mapped cache. Is it reasonable? Alpha uses it, MIPS uses it post R10K both used a way predicted 2-way set associative model 2 benefits: power and speed if prediction is correct. Fast hit and slow hit problem if prediction is incorrect. Simulations suggest that set prediction accuracy is in excess of 90% for a two-way set associative cache and 80% for a four-way set associative cache, with better accuracy on I-caches than D-caches 65

66 Way prediction: Reduces number of comparators Addressing is similar to direct cache Reduces power and reduces hit time Way selection: uses way prediction and accesses the data based on prediction. Overall, it reduces hit time, but when mis-prediction occurs it has to access the data again and discard the incorrectly accessed data. Hence, the hit time increases. 66

67 Way selection: for 4-way increases the average access time for the I-cache by 1.04 and for the D-cache by 1.13 Average cache power consumption relative to a normal 4-way: 0.28 for the I-cache and 0.35 for the D-cache. Power I-cache way-prediction = 0.28 power I-cache normal One significant drawback for way selection is that it makes it difficult to pipeline the cache access; however, as energy concerns have mounted, schemes that do not require powering up the entire cache make increasing sense. 67

68 Example Assume that there are half as many D-cache accesses as I- cache accesses and that the I-cache and D-cache are responsible for 25% and 15% of the processor s power consumption in a normal four-way set associative implementation. Determine if way selection improves performance per watt based on the estimates from the preceding study. Answer For the I-cache, the savings in power is = 0.07 of the total power, while for the D-cache it is = 0.05 for a total savings of The increase in cache access time is the increase in I-cache average access time plus one-half the increase in D-cache access time: Normal Access time= =1.5 Way Access time= = 1.60 This optimization is best used where power rather than performance is the key objective. 68

69 Reducing Hit Time: Avoiding Address Translation 69

70 70

71 Reducing Hit Time: Trace cache Instead of limiting the instructions in a static cache block to spatial locality, a trace cache finds a dynamic sequence of instructions including taken branches to load into a cache block. 71

72 2- Increasing Cache Bandwidth 72

73 Pipelined Access and Multi-banked Caches to Increase Bandwidth This technique is simply to pipeline cache access so that the effective latency of a L1 cache hit can be multiple clock cycles, giving fast cycle time and slow hits. For example, the pipeline for the for the Pentium 4 it takes four clocks. This split increases the number of pipeline stages, leading to greater penalty on mispredicted branches and more clock cycles between the issue of the load and the use of the data»pentium: 1 cycle»pentium Pro Pentium III: 2 cycles»pentium 4 Core i7: 4 cycles 73

74 Pipelined Access and Multi-banked Caches to Increase Bandwidth Four-way interleaved cache banks using block addressing. Assuming 64 bytes per block. 74

75 3- Reducing Cache Miss Penalty 75

1 st Miss Penalty Reduction Technique: Multi-Level Caches Question: faster cache to keep pace with the speed of CPUs, or larger cache to overcome the widening gap between the CPU and main memory?

76 1 st Miss Penalty Reduction Technique: Multi-Level Caches Question: faster cache to keep pace with the speed of CPUs, or larger cache to overcome the widening gap between the CPU and main memory? Local miss rate: the number of misses divided by the total number of memory accesses to this cache. For L1-cache it is equal to Miss rate L1 and for L2 cache it is Miss rate L2. Global miss rate: The number of misses in the cache divided by the total number of memory accesses generated by the CPU. For L1 cache is still just Miss rate L1 For L2 cache = Miss rate L1 Miss rate L2. 76

77 1 st Miss Penalty Reduction Technique: Multi-Level Caches EXAMPLE Suppose that in 1000 memory references there are 40 misses in the L1 cache and 20 misses in the L2 cache. What are the various miss rates? The miss rate (either local or global) for the L1 cache is 40/1000 or 4%. The local miss rate for the L2 cache is 20/40 or 50%. The global miss rate of the second-level cache is 20/1000 or 2%. 77

78 Miss rates Miss rates of 2 nd level cache. L1 has two 64KB Cache size (KB) Global miss rate of L2 where L1 is 32-KB 78

79 Relative execution time by second-level cache size. The two bars are for different clock cycles for a L2 cache hit. The reference execution time of 1.00 is for an 8192-KB second-level cache with a one-clock-cycle latency on a second-level hit. 79

80 Multilevel inclusion: L1 data is always present in L2. Inclusion is desirable because consistency between I/O and caches (or among caches in a multiprocessor) can be determined just by checking the second-level cache. If L2 cache is slightly bigger than the L1 cache then multilevel exclusion: L1 data is never found in L2 cache. Typically, with exclusion a cache miss in L1 results in a swap of blocks between L1 and L2 instead of a replacement of a L1 block with a L2 block. 80

81 2 nd Miss Penalty Reduction Technique: Critical Word First and Early Restart Critical word first: Request the missed word first from memoryandsendittothecpuassoonasitarrives;let the CPU continue execution while filling the rest of the words in the block. Critical-word-first fetch is also called wrapped fetch and requested word first. Early restart: Fetch the words in normal order, but as soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution. 81

82 3 rd Miss Penalty Reduction Technique: Giving Priority to Read Misses over Writes 82

83 4 th Miss Penalty Reduction Technique: Merging Write Buffer Write address V V V V Mem[100] Mem[108] Mem[116] Mem[124] Write address V V V V Mem[100] 1 Mem[108] 1 Mem[116] 1 Mem[124]

84 5 th Miss Penalty Reduction Technique: Victim Caches Placement of victim cache in the memory hierarchy. Although it reduces miss penalty, the victim cache is aimed at reducing the damage done by conflict misses. Jouppi [1990] : 4-entry victim cache reduces the miss penalty for 20% to 95% of conflict misses. 84

85 4- Reducing Miss Rate 85

86 Reducing Miss Rate Compulsory Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) Capacity (Misses in Fully Associative Size X Cache) Conflict Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache) More recent, 4 th C : Coherence - Misses caused by cache coherence. 86

87 Miss rate components (relative percent) (Sum = 100% of total miss rate) Degree Total Cache size associative miss rate Compulsory Capacity Conflict 4 KB 1-way % % % 4 KB 2-way % % % 4 KB 4-way % % % 4 KB 8-way % % % 8 KB 1-way % % % 8 KB 2-way % % % 8 KB 4-way % % % 8 KB 8-way % % % 16 KB 1-way % % % 16 KB 2-way % % % 16 KB 4-way % % % 16 KB 8-way % % % 32 KB 1-way % % % 32 KB 2-way % % % 32 KB 4-way % % % 32 KB 8-way % % % 64 KB 1-way % % % 64 KB 2-way % % % 64 KB 4-way % % % 64 KB 8-way % % % 128 KB 1-way % % % 128 KB 2-way % % % 128 KB 4-way % % % 128 KB 8-way % % % 256 KB 1-way % % % 256 KB 2-way % % % 256 KB 4-way % % % 256 KB 8-way % % % 512 KB 1-way % % % 512 KB 2-way % % % 512 KB 4-way % % % 512 KB 8-way % % % 87

88 Reducing Miss Rate Miss Rate per Type way 2-way 4-way 8-way Capacity Cache Size (KB) Compulsory 3Cs Absolute Miss Rate (SPEC92) 88

89 miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/ Reducing Miss Rate 0 1-way 2-way 4-way 8-way Capacity 2:1 Cache Rule Cache Size (KB) Compulsory 89

90 Reducing Miss Rate 100% 80% 60% 1-way 2-way 4-way 8-way 40% Capacity 20% 0% Cache Size (KB) 3Cs Relative Miss Rate 128 Compulsory 90

91 Reducing Miss Rate: Larger Block Size Cache size Block size 4K 16K 64K 256K % 3.94% 2.04% 1.09% % 2.87% 1.35% 0.70% % 2.64% 1.06% 0.51% % 2.77% 1.02% 0.49% % 3.29% 1.15% 0.49% Miss rates 91

92 Reducing Miss Rate: Larger Block Size Average memory access time versus block size Cache size Block size Miss penalty 1K 4K 16K 64K 256K Assume 80 clock cycles of overhead and then delivers 8 bytes every cycle 92

93 Reducing Miss Rate: Larger Block Size 25% Miss Rate 20% 15% 10% 5% 0% 1K 4K 16K 64K 256K Block Size (bytes) 93

94 Reducing Miss Rate: Larger Caches 0.08 Miss Rate Cache Size (KB) 94

95 Reducing Miss Rate: Higher Associativity 95

96 Reducing Miss Rate: Higher Associativity Associativity Cache size (KB) One-way Two-way Four-way Eight-way FIGURE 5.19 Average memory access time using miss rates in Figure 5.14 for parameters in the example. Red means that this time is higher than the number to the left; that is, higher associativity increases average memory access time. Clock cycle time2-way = 1.36 Clock cycle time1-way Clock cycle time4-way = 1.44 Clock cycle time1-way Clock cycle time8-way = 1.52 Clock cycle time1-way Average memory access time8-way = ( ) =

97 Reducing Miss Rate: Pseudo-Associative Caches How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache? Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) Hit Time Pseudo Hit Time Miss Penalty Time Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles Better for caches not tied directly to processor (L2) Used in MIPS R1000 L2 cache, similar in Ultra SPARC 97

98 Reducing Miss Rate: Pseudo-Associative Caches EXAMPLE: If block not found in regular hit, then needs 2 extra CC. Then which one is better: DIRECT, 2-WAY, or PSEUDO? ANSWER: Tavg. Acc. Pseudo = t hit pseudo + Miss Rate pseudo Miss Penalty pseudo Miss Rate pseudo Miss Penalty Pseudo = Miss Rate 2-way Miss Penalty 1-way t hit pseudo = t hit 1-way + Alternate hit rate pseudo 2 Alternate hit rate pseudo =hit Rate 2-way - hit rate 1-way Tavg. Acc. Pseudo = t hit 1-way + (Miss Rate 1-way - Miss Rate 2-way ) 2 + Miss Rate 2-way Penalty 1-way 98

99 Reducing Miss Rate: Compiler Optimizations McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software Instructions Reorder procedures in memory so as to reduce conflict misses Profiling to look at conflicts (using tools they developed) Data Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing blocks of data repeatedly vs. going down whole columns or rows 99

100 Reducing Miss Rate: Compiler Optimizations Merging Arrays Example 0,0 0, ,0 5000,100 Matrix is stored in row-major manner. The original code would skip through memory in strides of 100 words, while the revised version accesses all the words in one cache block before going to the next block. 100

101 Reducing Miss Rate: Compiler Optimizations (loop fusion) /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 101

102 Reducing Miss Rate: Compiler Optimizations (Blocking) Storing the arrays in row major order or in column major order does not solve the problem because both rows and columns are used in every iteration of the loop. Loop interchange efforts are not helpful either. 102

103 Reducing Miss Rate: Compiler Optimizations (Blocking) Sudhakar Yalamanchili, Georgia Institute of Technology 103

104 Reducing Miss Rate: Compiler Optimizations (Blocking) 104

105 105

106 5- Reducing Miss Penalty or Miss Rate via Parallelism 106

107 Nonblocking Caches % of the average memory stall time Benchmarks 107

108 Prefetch Hardware Prefetching of Instructions and Data Compiler-Controlled Prefetching: An alternative to hardware prefetching is for the compiler to insert prefetch instructions to request the data before they are needed. 108

109 109

110 Main Memory and Organizations for Improving Performance 110

111 Main memory satisfies the demands of caches and serves as the I/O interface, as it is the destination of input as well as the source for output. Performance measures of main memory emphasize both latency and bandwidth. (Memory bandwidth is the number of bytes read or written per unit time.) -Latency concerns cache -Bandwidth concerns I/O and multiprocessors. 111

112 Assume the performance of the basic memory organization is 4 clock cycles to send the address 56 clock cycles for the access time per word 4 clock cycles to send awordof data cache block = four words, word =8 bytes, miss penalty =4 ( ) = 256 clk cycles memory bandwidth = 1/8 byte =(4 8/256) per clock cycle. These values are our default case. 112

113 Higher memory bandwidth 113

114 Higher memory bandwidth First Technique for Higher Bandwidth: Wider Main Memory With a main memory width of two words, the miss penalty in our example would drop from 4 64 or 256 clock cycles as calculated above to 2 64 or 128 clock cycles. There is cost of BUS CPUs will still access the cache a word at a time, so there now needs to be a multiplexer between the cache and the CPU. 114

115 Higher memory bandwidth Simple Interleaved Memory EXAMPLE: Block size = 1 word, Memory bus width = 1 word, miss rate = 3%, Memory accesses per instruction = 1.2 Cache miss penalty = 64 cycles Average CPI (ignoring cache misses) = 2 Block size= 2 words miss rate 2% Block size= 4 words miss rate 1.2%. Which one better: interleaving 2-way, 4-way, or doubling the width of memory and the bus? (access times: 4, 56, 4) 115

116 Higher memory bandwidth ANSWER 1-word blocks : CPI= 2 + (1.2 3% 64) = word blocks: 64-bit bus, no interleaving, CPI =2+(1.2 2% 2 64) = bit bus, interleaving, CPI =2+(1.2 2% ( )) = bit bus, no interleaving, CPI = 2 + (1.2 2% 1 64) = word blocks: 64-bit bus, no interleaving, CPI =2+( % 4 64) = bit bus, interleaving, CPI =2+( % ( )) = bit bus, no interleaving, CPI =2+( % 2 64) = bit words 116

117 Higher memory bandwidth Third Technique for Higher Bandwidth: Independent Memory Banks A generalization of interleaving is to allow multiple independent accesses, where multiple memory controllers allow banks (or sets of word-interleaved banks) to operate independently. Each bank needs separate address lines and possibly a separate data bus. For example, an input device may use one controller and one bank, the cache read may use another, and a cache writemayuseathird. 117

118 Virtual Memory 118

119 Why Virtual Memory Demand paging: Using physical memory efficiently Memory management: Using physical memory simply Protection: Using physical memory safely 119

120 Virtual Memory Management Virtual Address Space for Process 1: Virtual Address Space for Process 2: 120

121 Virtual Memory Running multiple processes, each has its own address space. Virtual memory, divides physical memory into blocks and allocates them to different processes. Protection restricts a process to its blocks. Before VM: If a program became too large for physical memory, programmer divided it into pieces, then identified the pieces that were mutually exclusive, and loaded or unloaded these overlays under user program. 121

122 Virtual Memory The logical program in its contiguous virtual address space is shown on the left. It consists of four pages A, B, C, and D. The actual location of three of the blocks is in physical main memory and the other is located on the disk. 122

123 Examples of systems with only physical memory: 1- early pc s 2- almost all embedded systems. 123

124 Caches vs. virtual memory Replacement on cache misses is primarily controlled by hardware, while virtual memory replacement is primarily controlled by OS. The longer miss penalty means it s more important to make a good decision, so the OS can be involved and take time deciding what to replace. The size of the processor address determines the size of virtual memory, but the cache size is independent of the processor address size. In addition to acting as the lower-level backing store for main memory in the hierarchy, secondary storage is also used for the file system. Infact,thefilesystem occupies most of secondary storage. It is not normally in the address space. 124

125 Typical ranges of parameters 125

126 Example of how paging and segmentation divide a program. 126

127 Paging versus segmentation. 127

128 4Q s for virtual memory Q1: Where can a block be placed in main memory? Q2: How is a block found if it is in main memory? Q3: Which block should be replaced on a virtual memory miss? Q4: What happens on a write? 128

129 Q1: Where can a block be placed in main memory? The miss penalty for virtual memory involves access to a rotating magnetic storage device and is therefore quite high. Given the choice of lower miss rates or a simpler placement algorithm, operating systems designers normally pick lower miss rates because of the exorbitant miss penalty. Thus, operating systems allow blocks to be placed anywhere in main memory. According to the cache terminology this strategy would be labeled fully associative. 129

130 Q2: How is a block found if it is in main memory? Given a 32-bit virtual address, 4-KB pages, and 4 bytes per page table entry, the size of the page table would be (2 32 /2 12 ) 2 2 = 2 22 or 4 MB. 130

131 Q3: Which block should be replaced on a virtual memory miss? (Page faults vs. cache miss) almost all operating systems try to replace the leastrecently used (LRU) block. Q4: What happens on a write? The level below main memory contains rotating magnetic disks that take millions of clock cycles to access. Thus, the write strategy is always write back. 131

132 Techniques for Fast Address Translation 132

133 Selecting a Page Size Thesizeofthepagetableisinverselyproportionaltothe page size; memory (or other resources used for the memory map) can therefore be saved by making the pages bigger. A larger page size can allow larger caches with fast cache hit times. Transferring larger pages to or from secondary storage, possibly over a network, is more efficient than transferring smaller pages. The number of TLB entries are restricted, so a larger page size means that more memory can be mapped efficiently, thereby reducing the number of TLB misses. 133

Servicing a Page Fault Processor communicates with controller Read block of length P starting at disk address X and store starting at memory address Y

134 Servicing a Page Fault Processor communicates with controller Read block of length P starting at disk address X and store starting at memory address Y Read occurs Direct Memory Access (DMA) Done by I/O controller Controller signals completion Interrupt processor invokes OS OS resumes suspended process 134

135 A Typical Memory Hierarchy Split instruction & data primary caches (on-chip SRAM) Multiple interleaved memory banks (off-chip DRAM) CPU RF L1 Instruction Cache L1 Data Cache Unified L2 Cache Memory Memory Memory Memory Multiported register file (part of CPU) Large unified secondary cache (on-chip SRAM) 135

136 AMD Opteron caches and TLBs. 136

137 Protection via Virtual Memory 137

138 Meaning of Protection by Virtual Memory A normal user process should not be able to: Read/write another process memory Write into shared library data Role of virtual memory Address space isolation Protection information in page table Efficient clearing of data on newly allocated pages 138

139 Protecting Processes The simplest protection mechanism is a pair of registers that checks every address to be sure that it falls between the two limits, traditionally called base and bound. An address is valid if Base Address Bound 139

140 The computer designer has to help the operating system designer protect processes from each other by: 1. Provide at least two modes, indicating whether the running process is a user process or an operating system process ( called a kernel process, a supervisor process, or an executive process). 2. Provide a portion of the CPU state that a user process can use but not write. Including the base/bound registers, a user/supervisor mode bit(s), and the exception enable/disable bit. 3. Provide mechanisms whereby the CPU can go from user mode to supervisor mode and vice versa. The first direction is typically accomplished by a system call, implemented as a special instruction. The return to user mode is like a subroutine return that restores the previous user/supervisor mode. Protecting Processes 140

141 Protection via Virtual Machines The idea of virtual machine is old but with the popularity of multiprocessors it is gaining more attention because: the increasing importance of isolation and security in modern systems, the failures in security and reliability of standard operating systems, the sharing of a single computer among many unrelated users, and the dramatic increases in raw speed of processors, which makes the overhead of VMs more acceptable. 141

142 Protection via Virtual Machines A single computer runs multiple VMs and can support a number of different operating systems (OSes). A virtual machine monitor (VMM) or hypervisor is the heart of Virtual Machine and supports VMs. Hardware platform is host. Resources are shared by guest VMs. A physical resource may be time-shared, partitioned, or even emulated in software. 142

143 x86 CPU hardware actually provides four protection rings: 0, 1, 2, and 3. Only rings 0 (Kernel) and 3 (User) are typically used. 143

144 Kernel-User 1.Kernel Mode In Kernel mode, the executing code has complete and unrestricted access to the underlying hardware. It can execute any CPU instruction and reference any memory address. Kernel mode is generally reserved for the lowest-level, most trusted functions of the operating system. Crashes in kernel mode are catastrophic; they will halt the entire PC. 2. User Mode In User mode, the executing code has no ability to directly access hardware or reference memory. Code running in user mode must delegate to system APIs to access hardware or memory. Due to the protection afforded by this sort of isolation, crashes in user mode are always recoverable. Most of the code running on your computer will execute in user mode. 144

145 Hypervisor mode Under hypervisor virtualization, a program known as a hypervisor (also known as a type 1 Virtual Machine Monitor or VMM) runs directly on the hardware of the host system in ring 0. The task of this hypervisor is to handle resource and memory allocation for the virtual machines in addition to providing interfaces for higher level administration and monitoring tools. 145

146 Para-virtualization Under Para-virtualization the kernel of the guest operating system is modified specifically to run on the hypervisor. This typically involves replacing any privileged operations that will only run in ring 0 of the CPU with calls to the hypervisor (known as hypercalls). The hypervisor in turn performs the task on behalf of the guest kernel. 146

147 Full Virtualization Full virtualization provides support for unmodified guest operating systems. The term unmodified refers to operating system kernels which have not been altered to run on a hypervisor and therefore still execute privileged operations as though running in ring 0 of the CPU. 147

148 VMs also provide: 1. Managing software: VMs provide an abstraction that can run the complete software stack, even including old operating systems like DOS. A typical deployment might be some VMs running legacy OSes, many running the current stable OS release, and a few testing the next OS release. 2. Managing hardware: One reason for multiple servers is to have each application running with the compatible version of the operating system on separate computers, as this separation can improve dependability. VMs allow these separate software stacks to run independently yet share hardware, thereby consolidating the number of servers. Another example is that some VMMs support migration of a running VM to a different computer, either to balance load or to evacuate from failing hardware. 148

149 Requirements of a Virtual Machine Monitor VMM presents a software interface to guest software. It must isolate the state of guests from each other, and it must protect itself from guest software (including guest OSes). To virtualize the processor, the VMM must control access to privileged state, address translation, I/O, exceptions and interrupts. VMM, just like paged virtual memory, must have: At least two processor modes, system and user. A privileged subset of instructions that is available only in system mode, resulting in a trap if executed in user mode. All system resources must be controllable only via these instructions. 149

150 guest OS VMM virtual memory real memory physical memory In principle, the guest OS maps virtual memory to real memory via its page tables, and the VMM page tables map the guests real memory to physical memory. Rather than pay an extra level of indirection on every memory access, the VMM maintains a shadow page table that maps directly from the guest virtual address space to the physical address space of the hardware. 150

151 Virtualization Products VMware: The major software of the field. Provides hardware emulation virtualization products called VMware Server and ESX Server. Xen: An open source contender. Provides a para-virtualization solution. Xen comes bundled with most Linux distributions. XenSource: The commercial sponsor of Xen. Provides products that are commercial extensions of Xen focused on Windows virtualization. XenSource was recently acquired by Citrix. OpenVZ: An open source product providing operating system virtualization. Available for both Windows and Linux. SWsoft: The commercial sponsor of OpenVZ. Provides commercial version of OpenVZ called Virtuozzo. OpenSolaris: The open source version of Sun s Solaris operating system provides operating system virtualization. 151

152 An Example VMM: The Xen Virtual Machine Performance relative to native Linux 152

153 An Example VMM: The Xen Virtual Machine Receive throughput (Mbits/sec) Number of network interface cards 153

154 An Example VMM: The Xen Virtual Machine 154

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823