Memory Hierarchy Design

Size: px
Start display at page:

Download "Memory Hierarchy Design"

Transcription

1 Advanced Computer Architecture Memory Hierarchy Design

2 Some slides are from the instructors resources which accompany the 6 th and previous editions. Some slides are from David Patterson, David Culler and Krste Asanovic of UC Berkeley; Israel Koren of UM Amherst, and Milos Prvulovic of Georgia Tech. Otherwise, the source of the slide is mentioned at the bottom of the page. Please send an if a name is missing in the above list.

3 Introduction 3

4 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory Solution: organize memory system into a hierarchy Entire addressable memory space available in largest, slowest memory Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor Temporal and spatial locality insures that nearly all references can be found in smaller memories Gives the allusion of a large, fast memory being presented to the processor 4

5 Memory hierarchy: personal mobile device (PMD) 5

6 Memory hierarchy: laptop or desktop 6

7 Memory hierarchy: server 7

8 Processor-Memory Performance Gap Performance Moore s Law µproc 60%/year DRAM 7%/year Time (year) 8

9 Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors: Aggregate peak bandwidth grows with # cores: Intel Core i7 can generate two references per core per clock Four cores and 4.2 GHz clock 32.8 billion 64-bit data references/second billion 128-bit instruction references = GB/s! DRAM bandwidth is only 8% of this (34.1 GB/s) Requires: Multi-port, pipelined caches Two levels of cache per core Shared third-level cache on chip 9

10 Memory hierarchy in embedded computers Memory hierarchy of the embedded computers different than the desktops: 1- used in real-time applications, caches improve average case performance, but can degrade worst case performance. 2- Concerned about power and battery life, hardwareintensive memory hierarchy performance not interesting. 3- embedded systems only run one application. The protection role of the memory hierarchy is often diminished. 4- the main memory itself may be quite small and no disk storage. 10

11 Memory Technology 11

12 Core Memory Core memory was first large scale reliable main memory invented by Forrester in late 40s/early 50s at MIT for Whirlwind project Bits stored as magnetization polarity on small ferrite cores threaded onto 2 dimensional grid of wires Coincident current pulses on X and Y wires would write cell and also sense original state (destructive reads) Robust, non-volatile storage Cores threaded onto wires by hand (25 billion a year at peak production) Core access time ~ 1ms 12

13 Naive Register File 13

14 Memory Arrays: Register File 14

15 Memory Arrays: SRAM 15

16 Memory Arrays: DRAM 16

17 Semiconductor Memory, DRAM Semiconductor memory began to be competitive in early 1970s Intel formed to exploit market for semiconductor memory First commercial DRAM was Intel Kbit of storage on single chip charge on a capacitor used to hold value Semiconductor memory quickly replaced core in 70s 17

18 One Transistor Dynamic RAM 1-T DRAM Cell word V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate, trench, stack) poly word line W bottom electrode 18

19 DRAM Architecture Col. 1 bit lines Col. word lines 2 M Row 1 N+M N M Row Address Decoder Column Decoder & Sense Amplifiers Row 2 N Memory cell (one bit) Data D Bits stored in 2-dimensional arrays on chip Modern chips have around 4 logical banks on each chip each logical bank physically implemented as many smaller arrays 19

20 DRAM Packaging Clock and control signals Address lines multiplexed row/column address ~7 ~12 DRAM chip Data bus (4b,8b,16b,32b) DIMM (Dual Inline Memory Module) contains multiple chips with clock/control/address signals connected in parallel (sometimes need buffers to drive signals to all chips) Data pins work together to return wide word (e.g., 64-bit data bus using 16x4-bit parts) 20

21 DRAM Operation Three steps in read/write access to a given bank Row access (RAS) decode row address, enable addressed row (often multiple Kb in row) bitlines share charge with storage cell small change in voltage detected by sense amplifiers which latch whole row of bits sense amplifiers drive bitlines full rail to recharge storage cells Column access (CAS) decode column address to select small number of sense amplifier latches (4, 8, 16, or 32 bits depending on DRAM package) on read, send latched bits out to chip pins on write, change sense amplifier latches which then charge storage cells to required value can perform multiple column accesses on same row without another row access (burst mode) Precharge charges bit lines to known value, required before next row access Each step has a latency of around 15-20ns in modern DRAMs Various DRAM standards (DDR, RDRAM) have different ways of encoding the signals for transmission to the DRAM, but all share same core architecture 21

22 Capacity and access times for DDR SDRAMs by year of production. Access time is for a random memory word and assumes a new row must be opened. If the row is in a different bank, we assume the bank is precharged; if the row is not open, then a precharge is required, and the access time is longer. 22

23 Improving bandwidth 1. Timing signals that allow repeated accesses to the row buffer without another row access time, typically called fast page mode. Such a buffer comes naturally, as each array will buffer bits for each access. 2. Add a clock signal to the DRAM interface, so that the repeated transfers would not bear that overhead. Synchronous DRAM. SDRAMs typically also had a programmable register to hold the number of bytes requested, and hence can send many bytes over several cycles per request. 3. To transfer data on both the rising edge and falling edge of the DRAM clock signal, thereby doubling the peak data rate. This optimization is called double data rate (DDR). 23

24 Labeling DDR DIMM A DDR DIMM is connected to a 133 MHz bus. Why is it called PC2100? Its transfer rate is: 133 MHz 2 8 bytes = 2100 MB/sec 24

25 Labeling DDR DIMM Example Suppose a new DDR3 DIMM is transferring data at MB/sec. What should it be named? Answer The DIMM name should be PC The clock rate of the DIMM: Clock lock rate 2 8 = Clock rate = Clock rate = 1000 MHz. PC CL5 1066MHz DDR2 RAM 25

26 Memory Dependability Memory is susceptible to cosmic rays Soft errors: dynamic errors Detected and fixed by error correcting codes (ECC) Hard errors: permanent errors Use sparse rows to replace defective rows Chipkill: a RAID-like error recovery technique 26

27 HBM: Stacked or Embedded DRAMs Placing multiple DRAMs in a stacked or adjacent fashion embedded within the same package as the processor. DRAM die directly on the CPU die. Also called: high bandwidth memory (HBM). The 2.5D form is available now. 3D stacking is under development and faces heat management challenges due to the CPU. 27

28 Chipkill Chipkill was introduced by IBM to solve the problem memory chip failure. Similar in nature to the RAID approach used for disks, Chipkill distributes the data and ECC information, so that the complete failure of a single memory chip can be handled by supporting the reconstruction of the missing data from the remaining memory chips. Using an analysis by IBM and assuming a 10,000 processor server with 4 GB per processor yields the following rates of unrecoverable errors in 3 years of operation: 1. Parity only-about 90,000, or one unrecoverable (or undetected) failure every 17 minutes 2. ECC only-about 3500, or about one undetected or unrecoverable failure every 7.5 hours 3. Chipkill-6, or about one undetected or unrecoverable failure every 2 months 28

29 Review of the ABCs of Caches 29

30 1) cache hit 2) cache miss 3) block 4) virtual memory 5) page 6) page fault 7) memory stall cycles 8) miss penalty 9) miss rate 10) address trace 11) direct mapped 12) fully associative 13) n-way set associative 14) valid bit 15) dirty bit 16) least-recently used 17) random replacement 18) block address 19) block offset 20) tag field 21) index 22) write through 23) write back 24) write allocate 25) no-write allocate 26) instruction cache 27) data cache 28) unified cache 29) write buffer 30) average memory access time 31) hit time 32) misses per instruction 33) write stall 30

31 Miss-oriented approach to memory access 31

32 Example Assume we have a computer where the clocks per instruction (CPI) is 1.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions were cache hits? Ideal Cache Real Cache 32

33 Four Memory Hierarchy Questions For the first level of the memory hierarchy: Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy) 33

34 Locality of Reference Consider the string of references in a typical program 1. Temporal Locality 2. Spatial Locality 3. Sequential Locality 34

35 Common And Predictable Memory Reference Patterns 35

36 Memory Reference Patterns Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. Dr. IBM Shadrokh Systems Samavi Journal 10(3): (1971) 36

37 Prefetch Policy 1. Prefetch on a miss. 2. Tagged prefetch. BLOCK i BLOCK i+1 BLOCK i+2 37

38 Example: cache has 8 block frames and memory has 32 blocks. 38

39 Direct-Mapped Cache Tag Index Block Offset t V Tag k Data Block b 2 k lines = t Hit Data Word or Byte Krste Asanovic Dr. Shadrokh University Samavi of California at Berkeley 39

40 2-Way Set-Associative Cache Tag Index Block Offset b t V Tag k Data Block V Tag Data Block t = = Data Word or Byte Hit Krste Asanovic Dr. Shadrokh University Samavi of California at Berkeley 40

41 Fully Associative Cache V Tag Data Block = t Tag t = HIT Block Offset b = Data Word or Byte Krste Asanovic Dr. Shadrokh University Samavi of California at Berkeley 41

42 Disadvantage of Set Associative Cache N-way Set Associative Cache versus Direct Mapped Cache: N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss decision and set selection In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss. 42

43 Next and Last references d t (x) forward distance for block x= K b t (x) backward distance for block x= K 43

44 Replacement Policies Z(t) : existing set of blocks in cache. Q(z): the block that is to be replaced. 1. LRU Q(Z(t))=Y iff max{b t (x) } = y x Z 2. MIN Q(Z(t))=Y iff max{d t (x) } = y x Z 3. LFU 4. FIFO 5. CLOCK 6. LIFO 7. RAND 44

45 Clock Replacement Policy Initial values in LRU table Block 1 is referenced The sequence of references are: 1,2,3,4,5,6,7,8,1 after that a replacement is to be made. 45

46 Clock Replacement Policy Block 2 is referenced Block 3 is referenced

47 Clock Replacement Policy Block 4 is referenced Block 5 is referenced

48 Clock Replacement Policy Block 6 is referenced Block 7 is referenced

49 Clock Replacement Policy Block 8 is referenced Block 1 is referenced The block with all 0 s in its row is to be replaced: block 2. Block 1 was initially the candidate due to the FIFO strategy, but we gave Block 1 a second change since it has been recently used

50 Effect of different replacement policies Data cache misses per 1000 instructions 50

51 Cache Update Policies Write through (WT) WTWA WTNWA tc = cache cycle tm= memory cycle tb= block transfer time Wt = Write ratio Average time to complete a 1- WTWA: 2- WTNWA: 51

52 Cache Update Policies Write Back (TB) Simple write back (SWB) Flagged write back (FWB) Average reference time (SWB) = Average reference time (FWB) = 52

53 Intel Core i7-965 XE & Core i

54 CINEBENCH 9.5 CPU TEST 7

55 CineBench 11.5 Score (Higher-Better) CINEBENCH is a performance suite, which utilizes CINEMA 4D for both CPU and videobased testing. CINEMA 4D is a 3D modeling, animation, motion graphic and rendering application developed by MAXON Computer GmbH in Germany.

56 Miss per 1000 instructions Size Instruction cache Data cache Unified cache 8 KB KB KB KB KB KB

57 Example: Harvard Architecture Proc Unified 32KB L1 Cache 16KB L1 I-Cache Proc 16KB L1 D-Cache Which is better? Assume 36% instructions are data transfers 74% accesses are instructions hit time=1, miss time=100 Note that data hit has 1 stall for unified cache (only one port) Miss rate 16 KB Instruction = 3.82/1000= Miss rate 16 KB data = 40.9/1000= Miss rate 32 KB unified = 43.3/1000=

58 Example: Harvard Architecture Average Memory Access Time= AMAT AMAT split =74% ( )+26% ( ) =4.24 AMAT unified =74% ( )+26% ( )=

59 Improving Cache Performance Average memory access time = Hit time + Miss rate Miss penalty How do you define miss penalty? Is it the full latency of the miss to memory, or is it just the exposed or non-overlapped latency when the processor must stall? Memory stall cycles Instruction = Misses Instruction (Total miss latency - Overlapped miss latency) 59

60 Optimization of Cache Performance AMAT= hit time+ (miss rate miss penalty) 1. Reducing the hit time Small and simple first-level caches and way-prediction. Both techniques also generally decrease power consumption. 2. Increasing cache bandwidth: Pipelined caches, multibanked caches, and nonblocking caches. These techniques have varying impacts on power consumption. 3. Reducing the miss penalty Critical word first and merging write buffers. These optimizations have little impact on power. 4. Reducing the miss rate Compiler optimizations. Obviously any improvement at compile time improves power consumption. 5. Reducing the miss penalty or miss rate via parallelism Hardware prefetching and compiler prefetching. These optimizations generally increase power consumption, primarily because of prefetched data that are unused. 60

61 1- Reducing Cache hit time 61

62 Small and Simple First-Level Caches to Reduce Hit Time and Power Access time (ns) Cache size 62

63 Example Determine whether a 32 KB 4-way set associative L1 cache has a faster memory access time than a 32 KB 2-way set associative L1 cache. Assume the miss penalty to L2 is 15 times the access time for the faster L1 cache. Ignore misses beyond L2. Which has the faster average memory access time? Answer: For 2-way, AMAT= Hit time + Miss rate Miss penalty Miss penalty= access time to L2= 15 1 = = For 4-way: hit time 1.4 times longer. Miss penalty= 15 AMAT= Miss rate Miss penalty = =

64 Energy consumption per read 64

65 Way Prediction to Reduce Hit Time Idea conflict miss rate is decreased with higher associativity but hit time goes up due to: more complex circuits needed to select and mux the right set member Hence organize the cache as n-way set associative, but use a predictor to say which way to look in rather than the whole set hence hit times are the same as a simple direct mapped cache. Is it reasonable? Alpha uses it, MIPS uses it post R10K both used a way predicted 2-way set associative model 2 benefits: power and speed if prediction is correct. Fast hit and slow hit problem if prediction is incorrect. Simulations suggest that set prediction accuracy is in excess of 90% for a two-way set associative cache and 80% for a four-way set associative cache, with better accuracy on I-caches than D-caches 65

66 Way prediction: Reduces number of comparators Addressing is similar to direct cache Reduces power and reduces hit time Way selection: uses way prediction and accesses the data based on prediction. Overall, it reduces hit time, but when mis-prediction occurs it has to access the data again and discard the incorrectly accessed data. Hence, the hit time increases. 66

67 Way selection: for 4-way increases the average access time for the I-cache by 1.04 and for the D-cache by 1.13 Average cache power consumption relative to a normal 4-way: 0.28 for the I-cache and 0.35 for the D-cache. Power I-cache way-prediction = 0.28 power I-cache normal One significant drawback for way selection is that it makes it difficult to pipeline the cache access; however, as energy concerns have mounted, schemes that do not require powering up the entire cache make increasing sense. 67

68 Example Assume that there are half as many D-cache accesses as I- cache accesses and that the I-cache and D-cache are responsible for 25% and 15% of the processor s power consumption in a normal four-way set associative implementation. Determine if way selection improves performance per watt based on the estimates from the preceding study. Answer For the I-cache, the savings in power is = 0.07 of the total power, while for the D-cache it is = 0.05 for a total savings of The increase in cache access time is the increase in I-cache average access time plus one-half the increase in D-cache access time: Normal Access time= =1.5 Way Access time= = 1.60 This optimization is best used where power rather than performance is the key objective. 68

69 Reducing Hit Time: Avoiding Address Translation 69

70 70

71 Reducing Hit Time: Trace cache Instead of limiting the instructions in a static cache block to spatial locality, a trace cache finds a dynamic sequence of instructions including taken branches to load into a cache block. 71

72 2- Increasing Cache Bandwidth 72

73 Pipelined Access and Multi-banked Caches to Increase Bandwidth This technique is simply to pipeline cache access so that the effective latency of a L1 cache hit can be multiple clock cycles, giving fast cycle time and slow hits. For example, the pipeline for the for the Pentium 4 it takes four clocks. This split increases the number of pipeline stages, leading to greater penalty on mispredicted branches and more clock cycles between the issue of the load and the use of the data»pentium: 1 cycle»pentium Pro Pentium III: 2 cycles»pentium 4 Core i7: 4 cycles 73

74 Pipelined Access and Multi-banked Caches to Increase Bandwidth Four-way interleaved cache banks using block addressing. Assuming 64 bytes per block. 74

75 3- Reducing Cache Miss Penalty 75

76 1 st Miss Penalty Reduction Technique: Multi-Level Caches Question: faster cache to keep pace with the speed of CPUs, or larger cache to overcome the widening gap between the CPU and main memory? Local miss rate: the number of misses divided by the total number of memory accesses to this cache. For L1-cache it is equal to Miss rate L1 and for L2 cache it is Miss rate L2. Global miss rate: The number of misses in the cache divided by the total number of memory accesses generated by the CPU. For L1 cache is still just Miss rate L1 For L2 cache = Miss rate L1 Miss rate L2. 76

77 1 st Miss Penalty Reduction Technique: Multi-Level Caches EXAMPLE Suppose that in 1000 memory references there are 40 misses in the L1 cache and 20 misses in the L2 cache. What are the various miss rates? The miss rate (either local or global) for the L1 cache is 40/1000 or 4%. The local miss rate for the L2 cache is 20/40 or 50%. The global miss rate of the second-level cache is 20/1000 or 2%. 77

78 Miss rates Miss rates of 2 nd level cache. L1 has two 64KB Cache size (KB) Global miss rate of L2 where L1 is 32-KB 78

79 Relative execution time by second-level cache size. The two bars are for different clock cycles for a L2 cache hit. The reference execution time of 1.00 is for an 8192-KB second-level cache with a one-clock-cycle latency on a second-level hit. 79

80 Multilevel inclusion: L1 data is always present in L2. Inclusion is desirable because consistency between I/O and caches (or among caches in a multiprocessor) can be determined just by checking the second-level cache. If L2 cache is slightly bigger than the L1 cache then multilevel exclusion: L1 data is never found in L2 cache. Typically, with exclusion a cache miss in L1 results in a swap of blocks between L1 and L2 instead of a replacement of a L1 block with a L2 block. 80

81 2 nd Miss Penalty Reduction Technique: Critical Word First and Early Restart Critical word first: Request the missed word first from memoryandsendittothecpuassoonasitarrives;let the CPU continue execution while filling the rest of the words in the block. Critical-word-first fetch is also called wrapped fetch and requested word first. Early restart: Fetch the words in normal order, but as soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution. 81

82 3 rd Miss Penalty Reduction Technique: Giving Priority to Read Misses over Writes 82

83 4 th Miss Penalty Reduction Technique: Merging Write Buffer Write address V V V V Mem[100] Mem[108] Mem[116] Mem[124] Write address V V V V Mem[100] 1 Mem[108] 1 Mem[116] 1 Mem[124]

84 5 th Miss Penalty Reduction Technique: Victim Caches Placement of victim cache in the memory hierarchy. Although it reduces miss penalty, the victim cache is aimed at reducing the damage done by conflict misses. Jouppi [1990] : 4-entry victim cache reduces the miss penalty for 20% to 95% of conflict misses. 84

85 4- Reducing Miss Rate 85

86 Reducing Miss Rate Compulsory Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) Capacity (Misses in Fully Associative Size X Cache) Conflict Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache) More recent, 4 th C : Coherence - Misses caused by cache coherence. 86

87 Miss rate components (relative percent) (Sum = 100% of total miss rate) Degree Total Cache size associative miss rate Compulsory Capacity Conflict 4 KB 1-way % % % 4 KB 2-way % % % 4 KB 4-way % % % 4 KB 8-way % % % 8 KB 1-way % % % 8 KB 2-way % % % 8 KB 4-way % % % 8 KB 8-way % % % 16 KB 1-way % % % 16 KB 2-way % % % 16 KB 4-way % % % 16 KB 8-way % % % 32 KB 1-way % % % 32 KB 2-way % % % 32 KB 4-way % % % 32 KB 8-way % % % 64 KB 1-way % % % 64 KB 2-way % % % 64 KB 4-way % % % 64 KB 8-way % % % 128 KB 1-way % % % 128 KB 2-way % % % 128 KB 4-way % % % 128 KB 8-way % % % 256 KB 1-way % % % 256 KB 2-way % % % 256 KB 4-way % % % 256 KB 8-way % % % 512 KB 1-way % % % 512 KB 2-way % % % 512 KB 4-way % % % 512 KB 8-way % % % 87

88 Reducing Miss Rate Miss Rate per Type way 2-way 4-way 8-way Capacity Cache Size (KB) Compulsory 3Cs Absolute Miss Rate (SPEC92) 88

89 miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/ Reducing Miss Rate 0 1-way 2-way 4-way 8-way Capacity 2:1 Cache Rule Cache Size (KB) Compulsory 89

90 Reducing Miss Rate 100% 80% 60% 1-way 2-way 4-way 8-way 40% Capacity 20% 0% Cache Size (KB) 3Cs Relative Miss Rate 128 Compulsory 90

91 Reducing Miss Rate: Larger Block Size Cache size Block size 4K 16K 64K 256K % 3.94% 2.04% 1.09% % 2.87% 1.35% 0.70% % 2.64% 1.06% 0.51% % 2.77% 1.02% 0.49% % 3.29% 1.15% 0.49% Miss rates 91

92 Reducing Miss Rate: Larger Block Size Average memory access time versus block size Cache size Block size Miss penalty 1K 4K 16K 64K 256K Assume 80 clock cycles of overhead and then delivers 8 bytes every cycle 92

93 Reducing Miss Rate: Larger Block Size 25% Miss Rate 20% 15% 10% 5% 0% 1K 4K 16K 64K 256K Block Size (bytes) 93

94 Reducing Miss Rate: Larger Caches 0.08 Miss Rate Cache Size (KB) 94

95 Reducing Miss Rate: Higher Associativity 95

96 Reducing Miss Rate: Higher Associativity Associativity Cache size (KB) One-way Two-way Four-way Eight-way FIGURE 5.19 Average memory access time using miss rates in Figure 5.14 for parameters in the example. Red means that this time is higher than the number to the left; that is, higher associativity increases average memory access time. Clock cycle time2-way = 1.36 Clock cycle time1-way Clock cycle time4-way = 1.44 Clock cycle time1-way Clock cycle time8-way = 1.52 Clock cycle time1-way Average memory access time8-way = ( ) =

97 Reducing Miss Rate: Pseudo-Associative Caches How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache? Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) Hit Time Pseudo Hit Time Miss Penalty Time Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles Better for caches not tied directly to processor (L2) Used in MIPS R1000 L2 cache, similar in Ultra SPARC 97

98 Reducing Miss Rate: Pseudo-Associative Caches EXAMPLE: If block not found in regular hit, then needs 2 extra CC. Then which one is better: DIRECT, 2-WAY, or PSEUDO? ANSWER: Tavg. Acc. Pseudo = t hit pseudo + Miss Rate pseudo Miss Penalty pseudo Miss Rate pseudo Miss Penalty Pseudo = Miss Rate 2-way Miss Penalty 1-way t hit pseudo = t hit 1-way + Alternate hit rate pseudo 2 Alternate hit rate pseudo =hit Rate 2-way - hit rate 1-way Tavg. Acc. Pseudo = t hit 1-way + (Miss Rate 1-way - Miss Rate 2-way ) 2 + Miss Rate 2-way Penalty 1-way 98

99 Reducing Miss Rate: Compiler Optimizations McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software Instructions Reorder procedures in memory so as to reduce conflict misses Profiling to look at conflicts (using tools they developed) Data Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing blocks of data repeatedly vs. going down whole columns or rows 99

100 Reducing Miss Rate: Compiler Optimizations Merging Arrays Example 0,0 0, ,0 5000,100 Matrix is stored in row-major manner. The original code would skip through memory in strides of 100 words, while the revised version accesses all the words in one cache block before going to the next block. 100

101 Reducing Miss Rate: Compiler Optimizations (loop fusion) /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 101

102 Reducing Miss Rate: Compiler Optimizations (Blocking) Storing the arrays in row major order or in column major order does not solve the problem because both rows and columns are used in every iteration of the loop. Loop interchange efforts are not helpful either. 102

103 Reducing Miss Rate: Compiler Optimizations (Blocking) Sudhakar Yalamanchili, Georgia Institute of Technology 103

104 Reducing Miss Rate: Compiler Optimizations (Blocking) 104

105 105

106 5- Reducing Miss Penalty or Miss Rate via Parallelism 106

107 Nonblocking Caches % of the average memory stall time Benchmarks 107

108 Prefetch Hardware Prefetching of Instructions and Data Compiler-Controlled Prefetching: An alternative to hardware prefetching is for the compiler to insert prefetch instructions to request the data before they are needed. 108

109 109

110 Main Memory and Organizations for Improving Performance 110

111 Main memory satisfies the demands of caches and serves as the I/O interface, as it is the destination of input as well as the source for output. Performance measures of main memory emphasize both latency and bandwidth. (Memory bandwidth is the number of bytes read or written per unit time.) -Latency concerns cache -Bandwidth concerns I/O and multiprocessors. 111

112 Assume the performance of the basic memory organization is 4 clock cycles to send the address 56 clock cycles for the access time per word 4 clock cycles to send awordof data cache block = four words, word =8 bytes, miss penalty =4 ( ) = 256 clk cycles memory bandwidth = 1/8 byte =(4 8/256) per clock cycle. These values are our default case. 112

113 Higher memory bandwidth 113

114 Higher memory bandwidth First Technique for Higher Bandwidth: Wider Main Memory With a main memory width of two words, the miss penalty in our example would drop from 4 64 or 256 clock cycles as calculated above to 2 64 or 128 clock cycles. There is cost of BUS CPUs will still access the cache a word at a time, so there now needs to be a multiplexer between the cache and the CPU. 114

115 Higher memory bandwidth Simple Interleaved Memory EXAMPLE: Block size = 1 word, Memory bus width = 1 word, miss rate = 3%, Memory accesses per instruction = 1.2 Cache miss penalty = 64 cycles Average CPI (ignoring cache misses) = 2 Block size= 2 words miss rate 2% Block size= 4 words miss rate 1.2%. Which one better: interleaving 2-way, 4-way, or doubling the width of memory and the bus? (access times: 4, 56, 4) 115

116 Higher memory bandwidth ANSWER 1-word blocks : CPI= 2 + (1.2 3% 64) = word blocks: 64-bit bus, no interleaving, CPI =2+(1.2 2% 2 64) = bit bus, interleaving, CPI =2+(1.2 2% ( )) = bit bus, no interleaving, CPI = 2 + (1.2 2% 1 64) = word blocks: 64-bit bus, no interleaving, CPI =2+( % 4 64) = bit bus, interleaving, CPI =2+( % ( )) = bit bus, no interleaving, CPI =2+( % 2 64) = bit words 116

117 Higher memory bandwidth Third Technique for Higher Bandwidth: Independent Memory Banks A generalization of interleaving is to allow multiple independent accesses, where multiple memory controllers allow banks (or sets of word-interleaved banks) to operate independently. Each bank needs separate address lines and possibly a separate data bus. For example, an input device may use one controller and one bank, the cache read may use another, and a cache writemayuseathird. 117

118 Virtual Memory 118

119 Why Virtual Memory Demand paging: Using physical memory efficiently Memory management: Using physical memory simply Protection: Using physical memory safely 119

120 Virtual Memory Management Virtual Address Space for Process 1: Virtual Address Space for Process 2: 120

121 Virtual Memory Running multiple processes, each has its own address space. Virtual memory, divides physical memory into blocks and allocates them to different processes. Protection restricts a process to its blocks. Before VM: If a program became too large for physical memory, programmer divided it into pieces, then identified the pieces that were mutually exclusive, and loaded or unloaded these overlays under user program. 121

122 Virtual Memory The logical program in its contiguous virtual address space is shown on the left. It consists of four pages A, B, C, and D. The actual location of three of the blocks is in physical main memory and the other is located on the disk. 122

123 Examples of systems with only physical memory: 1- early pc s 2- almost all embedded systems. 123

124 Caches vs. virtual memory Replacement on cache misses is primarily controlled by hardware, while virtual memory replacement is primarily controlled by OS. The longer miss penalty means it s more important to make a good decision, so the OS can be involved and take time deciding what to replace. The size of the processor address determines the size of virtual memory, but the cache size is independent of the processor address size. In addition to acting as the lower-level backing store for main memory in the hierarchy, secondary storage is also used for the file system. Infact,thefilesystem occupies most of secondary storage. It is not normally in the address space. 124

125 Typical ranges of parameters 125

126 Example of how paging and segmentation divide a program. 126

127 Paging versus segmentation. 127

128 4Q s for virtual memory Q1: Where can a block be placed in main memory? Q2: How is a block found if it is in main memory? Q3: Which block should be replaced on a virtual memory miss? Q4: What happens on a write? 128

129 Q1: Where can a block be placed in main memory? The miss penalty for virtual memory involves access to a rotating magnetic storage device and is therefore quite high. Given the choice of lower miss rates or a simpler placement algorithm, operating systems designers normally pick lower miss rates because of the exorbitant miss penalty. Thus, operating systems allow blocks to be placed anywhere in main memory. According to the cache terminology this strategy would be labeled fully associative. 129

130 Q2: How is a block found if it is in main memory? Given a 32-bit virtual address, 4-KB pages, and 4 bytes per page table entry, the size of the page table would be (2 32 /2 12 ) 2 2 = 2 22 or 4 MB. 130

131 Q3: Which block should be replaced on a virtual memory miss? (Page faults vs. cache miss) almost all operating systems try to replace the leastrecently used (LRU) block. Q4: What happens on a write? The level below main memory contains rotating magnetic disks that take millions of clock cycles to access. Thus, the write strategy is always write back. 131

132 Techniques for Fast Address Translation 132

133 Selecting a Page Size Thesizeofthepagetableisinverselyproportionaltothe page size; memory (or other resources used for the memory map) can therefore be saved by making the pages bigger. A larger page size can allow larger caches with fast cache hit times. Transferring larger pages to or from secondary storage, possibly over a network, is more efficient than transferring smaller pages. The number of TLB entries are restricted, so a larger page size means that more memory can be mapped efficiently, thereby reducing the number of TLB misses. 133

134 Servicing a Page Fault Processor communicates with controller Read block of length P starting at disk address X and store starting at memory address Y Read occurs Direct Memory Access (DMA) Done by I/O controller Controller signals completion Interrupt processor invokes OS OS resumes suspended process 134

135 A Typical Memory Hierarchy Split instruction & data primary caches (on-chip SRAM) Multiple interleaved memory banks (off-chip DRAM) CPU RF L1 Instruction Cache L1 Data Cache Unified L2 Cache Memory Memory Memory Memory Multiported register file (part of CPU) Large unified secondary cache (on-chip SRAM) 135

136 AMD Opteron caches and TLBs. 136

137 Protection via Virtual Memory 137

138 Meaning of Protection by Virtual Memory A normal user process should not be able to: Read/write another process memory Write into shared library data Role of virtual memory Address space isolation Protection information in page table Efficient clearing of data on newly allocated pages 138

139 Protecting Processes The simplest protection mechanism is a pair of registers that checks every address to be sure that it falls between the two limits, traditionally called base and bound. An address is valid if Base Address Bound 139

140 The computer designer has to help the operating system designer protect processes from each other by: 1. Provide at least two modes, indicating whether the running process is a user process or an operating system process ( called a kernel process, a supervisor process, or an executive process). 2. Provide a portion of the CPU state that a user process can use but not write. Including the base/bound registers, a user/supervisor mode bit(s), and the exception enable/disable bit. 3. Provide mechanisms whereby the CPU can go from user mode to supervisor mode and vice versa. The first direction is typically accomplished by a system call, implemented as a special instruction. The return to user mode is like a subroutine return that restores the previous user/supervisor mode. Protecting Processes 140

141 Protection via Virtual Machines The idea of virtual machine is old but with the popularity of multiprocessors it is gaining more attention because: the increasing importance of isolation and security in modern systems, the failures in security and reliability of standard operating systems, the sharing of a single computer among many unrelated users, and the dramatic increases in raw speed of processors, which makes the overhead of VMs more acceptable. 141

142 Protection via Virtual Machines A single computer runs multiple VMs and can support a number of different operating systems (OSes). A virtual machine monitor (VMM) or hypervisor is the heart of Virtual Machine and supports VMs. Hardware platform is host. Resources are shared by guest VMs. A physical resource may be time-shared, partitioned, or even emulated in software. 142

143 x86 CPU hardware actually provides four protection rings: 0, 1, 2, and 3. Only rings 0 (Kernel) and 3 (User) are typically used. 143

144 Kernel-User 1.Kernel Mode In Kernel mode, the executing code has complete and unrestricted access to the underlying hardware. It can execute any CPU instruction and reference any memory address. Kernel mode is generally reserved for the lowest-level, most trusted functions of the operating system. Crashes in kernel mode are catastrophic; they will halt the entire PC. 2. User Mode In User mode, the executing code has no ability to directly access hardware or reference memory. Code running in user mode must delegate to system APIs to access hardware or memory. Due to the protection afforded by this sort of isolation, crashes in user mode are always recoverable. Most of the code running on your computer will execute in user mode. 144

145 Hypervisor mode Under hypervisor virtualization, a program known as a hypervisor (also known as a type 1 Virtual Machine Monitor or VMM) runs directly on the hardware of the host system in ring 0. The task of this hypervisor is to handle resource and memory allocation for the virtual machines in addition to providing interfaces for higher level administration and monitoring tools. 145

146 Para-virtualization Under Para-virtualization the kernel of the guest operating system is modified specifically to run on the hypervisor. This typically involves replacing any privileged operations that will only run in ring 0 of the CPU with calls to the hypervisor (known as hypercalls). The hypervisor in turn performs the task on behalf of the guest kernel. 146

147 Full Virtualization Full virtualization provides support for unmodified guest operating systems. The term unmodified refers to operating system kernels which have not been altered to run on a hypervisor and therefore still execute privileged operations as though running in ring 0 of the CPU. 147

148 VMs also provide: 1. Managing software: VMs provide an abstraction that can run the complete software stack, even including old operating systems like DOS. A typical deployment might be some VMs running legacy OSes, many running the current stable OS release, and a few testing the next OS release. 2. Managing hardware: One reason for multiple servers is to have each application running with the compatible version of the operating system on separate computers, as this separation can improve dependability. VMs allow these separate software stacks to run independently yet share hardware, thereby consolidating the number of servers. Another example is that some VMMs support migration of a running VM to a different computer, either to balance load or to evacuate from failing hardware. 148

149 Requirements of a Virtual Machine Monitor VMM presents a software interface to guest software. It must isolate the state of guests from each other, and it must protect itself from guest software (including guest OSes). To virtualize the processor, the VMM must control access to privileged state, address translation, I/O, exceptions and interrupts. VMM, just like paged virtual memory, must have: At least two processor modes, system and user. A privileged subset of instructions that is available only in system mode, resulting in a trap if executed in user mode. All system resources must be controllable only via these instructions. 149

150 guest OS VMM virtual memory real memory physical memory In principle, the guest OS maps virtual memory to real memory via its page tables, and the VMM page tables map the guests real memory to physical memory. Rather than pay an extra level of indirection on every memory access, the VMM maintains a shadow page table that maps directly from the guest virtual address space to the physical address space of the hardware. 150

151 Virtualization Products VMware: The major software of the field. Provides hardware emulation virtualization products called VMware Server and ESX Server. Xen: An open source contender. Provides a para-virtualization solution. Xen comes bundled with most Linux distributions. XenSource: The commercial sponsor of Xen. Provides products that are commercial extensions of Xen focused on Windows virtualization. XenSource was recently acquired by Citrix. OpenVZ: An open source product providing operating system virtualization. Available for both Windows and Linux. SWsoft: The commercial sponsor of OpenVZ. Provides commercial version of OpenVZ called Virtuozzo. OpenSolaris: The open source version of Sun s Solaris operating system provides operating system virtualization. 151

152 An Example VMM: The Xen Virtual Machine Performance relative to native Linux 152

153 An Example VMM: The Xen Virtual Machine Receive throughput (Mbits/sec) Number of network interface cards 153

154 An Example VMM: The Xen Virtual Machine 154

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design Edited by Mansour Al Zuair 1 Introduction Programmers want unlimited amounts of memory with low latency Fast

More information

CS 152 Computer Architecture and Engineering. Lecture 6 - Memory

CS 152 Computer Architecture and Engineering. Lecture 6 - Memory CS 152 Computer Architecture and Engineering Lecture 6 - Memory Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste! http://inst.eecs.berkeley.edu/~cs152!

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Lecture 6 - Memory. Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

Lecture 6 - Memory. Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory CS 152 Computer Architecture and Engineering Lecture 6 - Memory Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory http://inst.eecs.berkeley.edu/~cs152

More information

COSC 6385 Computer Architecture - Memory Hierarchies (II)

COSC 6385 Computer Architecture - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic

More information

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial

More information

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

Memories. CPE480/CS480/EE480, Spring Hank Dietz. Memories CPE480/CS480/EE480, Spring 2018 Hank Dietz http://aggregate.org/ee480 What we want, what we have What we want: Unlimited memory space Fast, constant, access time (UMA: Uniform Memory Access) What

More information

CSE 502 Graduate Computer Architecture

CSE 502 Graduate Computer Architecture Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 CSE 502 Graduate Computer Architecture Lec 11-14 Advanced Memory Memory Hierarchy Design Larry Wittie Computer Science, StonyBrook

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

Memory Hierarchy and Caches

Memory Hierarchy and Caches Memory Hierarchy and Caches COE 301 / ICS 233 Computer Organization Dr. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum and Minerals Presentation Outline

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Introduction to cache memories

Introduction to cache memories Course on: Advanced Computer Architectures Introduction to cache memories Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1 Summary Summary Main goal Spatial and temporal

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Topics. Digital Systems Architecture EECE EECE Need More Cache? Digital Systems Architecture EECE 33-0 EECE 9-0 Need More Cache? Dr. William H. Robinson March, 00 http://eecs.vanderbilt.edu/courses/eece33/ Topics Cache: a safe place for hiding or storing things. Webster

More information

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =

More information

Lecture-14 (Memory Hierarchy) CS422-Spring

Lecture-14 (Memory Hierarchy) CS422-Spring Lecture-14 (Memory Hierarchy) CS422-Spring 2018 Biswa@CSE-IITK The Ideal World Instruction Supply Pipeline (Instruction execution) Data Supply - Zero-cycle latency - Infinite capacity - Zero cost - Perfect

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

COSC 6385 Computer Architecture - Memory Hierarchies (III)

COSC 6385 Computer Architecture - Memory Hierarchies (III) COSC 6385 Computer Architecture - Memory Hierarchies (III) Edgar Gabriel Spring 2014 Memory Technology Performance metrics Latency problems handled through caches Bandwidth main concern for main memory

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

EN1640: Design of Computing Systems Topic 06: Memory System

EN1640: Design of Computing Systems Topic 06: Memory System EN164: Design of Computing Systems Topic 6: Memory System Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University Spring

More information

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 5 Large and Fast: Exploiting Memory Hierarchy 5 th Edition Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic

More information

The University of Adelaide, School of Computer Science 13 September 2018

The University of Adelaide, School of Computer Science 13 September 2018 Computer Architecture A Quantitative Approach, Sixth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site: Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, 2003 Textbook web site: www.vrtechnology.org 1 Textbook web site: www.vrtechnology.org Laboratory Hardware 2 Topics 14:332:331

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): Motivation for The Memory Hierarchy: { CPU/Memory Performance Gap The Principle Of Locality Cache $$$$$ Cache Basics:

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Static RAM (SRAM) Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 0.5ns 2.5ns, $2000 $5000 per GB 5.1 Introduction Memory Technology 5ms

More information

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

CPU issues address (and data for write) Memory returns data (or acknowledgment for write) The Main Memory Unit CPU and memory unit interface Address Data Control CPU Memory CPU issues address (and data for write) Memory returns data (or acknowledgment for write) Memories: Design Objectives

More information

CSC 631: High-Performance Computer Architecture

CSC 631: High-Performance Computer Architecture CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 9: Memory Part I CSC 631: High-Performance Computer Architecture 1 Introduction Programmers want unlimited amounts of memory with low

More information

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University COSC4201 Chapter 5 Memory Hierarchy Design Prof. Mokhtar Aboelaze York University 1 Memory Hierarchy The gap between CPU performance and main memory has been widening with higher performance CPUs creating

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson) CSE 4201 Memory Hierarchy Design Ch. 5 (Hennessy and Patterson) Memory Hierarchy We need huge amount of cheap and fast memory Memory is either fast or cheap; never both. Do as politicians do: fake it Give

More information

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory CS65 Computer Architecture Lecture 9 Memory Hierarchy - Main Memory Andrew Sohn Computer Science Department New Jersey Institute of Technology Lecture 9: Main Memory 9-/ /6/ A. Sohn Memory Cycle Time 5

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1 Memory Hierarchy Maurizio Palesi Maurizio Palesi 1 References John L. Hennessy and David A. Patterson, Computer Architecture a Quantitative Approach, second edition, Morgan Kaufmann Chapter 5 Maurizio

More information

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu CENG 3420 Computer Organization and Design Lecture 08: Memory - I Bei Yu CEG3420 L08.1 Spring 2016 Outline q Why Memory Hierarchy q How Memory Hierarchy? SRAM (Cache) & DRAM (main memory) Memory System

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Organization Prof. Michel A. Kinsy The course has 4 modules Module 1 Instruction Set Architecture (ISA) Simple Pipelining and Hazards Module 2 Superscalar Architectures

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address

More information

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming MEMORY HIERARCHY DESIGN B649 Parallel Architectures and Programming Basic Optimizations Average memory access time = Hit time + Miss rate Miss penalty Larger block size to reduce miss rate Larger caches

More information

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Misses Classifying Misses: 3 Cs! Compulsory The first access to a block is

More information

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Improving Cache Performance [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user

More information

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses Soner Onder Michigan Technological University Randy Katz & David A. Patterson University of California, Berkeley Four Questions for Memory Hierarchy Designers

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1 Some useful definitions When the CPU finds a requested

More information

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion Improving Cache Performance Dr. Yitzhak Birk Electrical Engineering Department, Technion 1 Cache Performance CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time Memory

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science CPUtime = IC CPI Execution + Memory accesses Instruction

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University Large and Fast: Exploiting Memory Hierarchy The Basic of Caches Measuring & Improving Cache Performance Virtual Memory A Common

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 11: Memory

CS252 Spring 2017 Graduate Computer Architecture. Lecture 11: Memory CS252 Spring 2017 Graduate Computer Architecture Lecture 11: Memory Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Logistics for the 15-min meeting next Tuesday Email

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

Page 1. Memory Hierarchies (Part 2)

Page 1. Memory Hierarchies (Part 2) Memory Hierarchies (Part ) Outline of Lectures on Memory Systems Memory Hierarchies Cache Memory 3 Virtual Memory 4 The future Increasing distance from the processor in access time Review: The Memory Hierarchy

More information

Advanced Computer Architecture- 06CS81-Memory Hierarchy Design

Advanced Computer Architecture- 06CS81-Memory Hierarchy Design Advanced Computer Architecture- 06CS81-Memory Hierarchy Design AMAT and Processor Performance AMAT = Average Memory Access Time Miss-oriented Approach to Memory Access CPIExec includes ALU and Memory instructions

More information

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2) The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache

More information

Memory. Lecture 22 CS301

Memory. Lecture 22 CS301 Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch

More information

CS/ECE 3330 Computer Architecture. Chapter 5 Memory

CS/ECE 3330 Computer Architecture. Chapter 5 Memory CS/ECE 3330 Computer Architecture Chapter 5 Memory Last Chapter n Focused exclusively on processor itself n Made a lot of simplifying assumptions IF ID EX MEM WB n Reality: The Memory Wall 10 6 Relative

More information

CMSC 611: Advanced Computer Architecture. Cache and Memory

CMSC 611: Advanced Computer Architecture. Cache and Memory CMSC 611: Advanced Computer Architecture Cache and Memory Classification of Cache Misses Compulsory The first access to a block is never in the cache. Also called cold start misses or first reference misses.

More information

Chapter-5 Memory Hierarchy Design

Chapter-5 Memory Hierarchy Design Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or

More information

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B. Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.5) Memory Technologies Dynamic Random Access Memory (DRAM) Optimized

More information

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

CS 152 Computer Architecture and Engineering. Lecture 6 - Memory. Last =me in Lecture 5

CS 152 Computer Architecture and Engineering. Lecture 6 - Memory. Last =me in Lecture 5 CS 152 Computer Architecture and Engineering Lecture 6 - Memory Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste! http://inst.eecs.berkeley.edu/~cs152!

More information

Lecture 11. Virtual Memory Review: Memory Hierarchy

Lecture 11. Virtual Memory Review: Memory Hierarchy Lecture 11 Virtual Memory Review: Memory Hierarchy 1 Administration Homework 4 -Due 12/21 HW 4 Use your favorite language to write a cache simulator. Input: address trace, cache size, block size, associativity

More information

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer

More information

The Memory Hierarchy & Cache

The Memory Hierarchy & Cache Removing The Ideal Memory Assumption: The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs. SRAM The Motivation for The Memory

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Different Storage Memories Chapter 5 Large and Fast: Exploiting Memory

More information

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB Memory Technology Caches 1 Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per GB Ideal memory Average access time similar

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface COEN-4710 Computer Hardware Lecture 7 Large and Fast: Exploiting Memory Hierarchy (Chapter 5) Cristinel Ababei Marquette University Department

More information

COMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy

COMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy COMPUTER ARCHITECTURE Virtualization and Memory Hierarchy 2 Contents Virtual memory. Policies and strategies. Page tables. Virtual machines. Requirements of virtual machines and ISA support. Virtual machines:

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information