Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Size: px

Start display at page:

Download "Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST"

Brianna Barnett
6 years ago
Views:

1 Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST

2 Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial locality Temporal locality Memory stall cycles IC X Memory references per instruction X Miss rate X Miss penalty Questions Block placement Block identification Block replacement Write strategy

4 Direct mapped (block address) % (no. of blocks in cache) Fully associative Set associative (block address) % (no. of sets in cache) 1 way direct mapped cache 1 set fully associative cache 2-way, 4-way or 8-way are most common

6 A valid bit address = block address : block offset block address = tag : index index = set id. index = block id. index = none in set-associative caches in direct-mapped caches in fully associative caches

8 Random should be reproducible for debugging pseudo random number generator Least Recently Used (LRU) replace the one that has been unused for the longest time FIFO (First-In, First-Out) worse than random or LRU

11 9% stores, 25% loads for MIPS programs makes 7% writes of the overall memory traffic including inst. fetches 25% of data cache traffic Writes are slower than reads modifying a block cannot begin until the tag is checked variable write size: 1 byte, 2 bytes, 4 bytes, 8 bytes only a portion of a block can be changed

12 Write policy tag compare data write write through (copy through) write to both cache and memory easier to implement slower write back (copy back) dirty bit to reduce the frequency of writing back blocks on replacement writes occur at the speed of the cache memory inconsistency between cache and memory Write buffer Write with a cache miss write allocate: loaded on a write miss used in write-back policy no-write allocate: not loaded into the cache used in write-through policy

13 Miss rate is misleading Average memory access time = Hit time + Miss rate X Miss penalty look through: search cache first, and then memory if it is a miss cf. look aside: search cache and memory in parallel CPU time = clock cycle time x (CPU execution cycles + memory stall cycles) memory stall cycles = reads X read miss rate X read miss penalty + writes X write miss rate X write miss penalty ~ memory access X miss rate X miss penalty CPU time = IC X clock cycle time X (CPI + memory access per inst X miss rate X miss penalty) misses per inst. memory stalls per inst.

14 Average memory access time = Hit time + Miss rate X Miss penalty Cache miss categories compulsory: first reference misses, i.e., cold start misses make the block size larger increase other kinds of misses capacity: enlarge the cache conflict: collision misses, interference misses make cache fully associative longer cycle time expensive implementation

17 Larger block size reduce compulsory misses increase miss penalty fetch data or instructions not to be used High associativity direct mapped cache of size N has about the same miss rate as a 2-way set associative cache of size N/2 increase cache hit time Victim cache contains blocks discarded from cache because of the replacement Checked on a miss 1~5 entries are effective in reducing conflict misses

20 Pseudo associativity direct mapped cache if a miss, another cache entry is checked with inverting MSB. Prefetching instruction stream buffer: 1 block would catch 15% to 25% of misses 4 blocks, 50% and 16 blocks, 72% data stream buffer 1 block, 25% Compiler-controlled prefetching insert instructions for register prefetch, i.e., cache prefetch inst. non-faulting instruction make sense in non-blocking caches

21 Compiler optimizations rearrange the code to reduce instruction miss rate improve the spatial and temporal locality of data Examples Merging arrays improving spatial locality Loop interchange improving spatial locality Loop fusion improving temporal locality Blocking improving temporal locality

25 Improving temporal locality

28 Give priority to read misses over writes Write buffer The read miss waits until the write buffer is empty Read miss continues if there are no conflicts with the write buffer Write back with dirty bit copy dirty block to a buffer read write the copied block Sub-block placement valid bit for each sub-block Sector cache Early Restart and Critical Word First Early restart CPU continues execution as soon as the requested word arrives Critical word first request the missed word first from memory

30 Nonblocking caches allow cache accesses while fetching missed data/instructions hit under one miss: effective for integer program hit under multiple misses: effective for FP programs

31 Second level caches Average access time = Hit timel1 + Miss ratel1 X ( Hit time L2 + Miss ratel2 X Miss penaltyl2) local miss rate = Miss ratel2 large compared to Miss ratel1, not a good measure of L2 global miss rate = Miss ratel1 X Miss ratel2 L2 must be much bigger than L1 Increased block size: 64, 128, and 256 bytes are popular Multilevel inclusion property: L1 L2

35 Small and simple caches Avoiding address translation during indexing the cache Physical cache Address translation cache access Pipelining Virtual cache A process switch requires the caches to be flushed PID (process identification tag) More than one virtual address for the same physical address Aliases or synonyms Anti-aliasing Page coloring: forces aliases to share some address bits I/O address: Require mapping to virtual address to interact with a virtual cache

37 Avoiding address translation during indexing the cache Virtually indexed, physically tagged cache The cache is accessed with page offset while sending the virtual part to be translated Limitation A direct-mapped cache can be no bigger than page size For large caches High associativity Make a few bits of the virtual and physical page addresses identical Small hardware mapping table Pipelining writes for fast write hits Tag comparison data write The second stage occurs during the first stage of the next write

40 Memory DRAM One transistor cell One capacitor need to refresh Used for large capacity 4-fold improvement in capacity every 3 years Destructive reading One bit output multi-bit output SRAM 4 or 6 transistor cell No refresh Used for speed and capacity Multi-bit output

42 Wider main memory Wider memory bandwidth Leads to low miss penalty Needs a multiplexer between cache and CPU Have a difficulty in writing to a memory with error correction code Separate ECC for every word Interleaved memory Multiple banks Usually word interleaved Permits simultaneous reading: Spatial locality Permits separate writing/reading: Successive writing if not destined to the same bank No. of banks No. of clocks to access a word in a bank

43 Capacity per memory chip increases fewer memory chips in the memory system make multiple banks more expensive DRAM with wider paths Independent memory banks Separate address lines and possibly separate data lines

44 Avoiding memory bank conflicts Interleaved banks Bank No. = address % No. of banks Address within bank = address / No. of banks A prime number of banks Address within bank = address % No. of words in bank Words in a bank = 2 M Chinese remainder theorem Residue number system No. of banks = 2 N -1 There are fast hardware schemes

46 DRAM specific interleaving Nibble mode Page mode Static column mode RAMBUS Packet switched bus or split-transaction bus

48 The aim of MMU MMU provides a programmer large virtual memory through address translation. CPU MMU Memory Virtual address Role of MMU Protection and sharing Relocation Logical/Physical memory organization Physical address

49 Conceptual view of MMU Virtual address space Translation table f 2 f Data Physical address space f fault Address translation mechanism Paging Segmentation Paged segmentation

50 Paging Most widely used virtual memory technique Divide virtual and physical address space into pages of the same size Contiguous virtual address scattered in physical address space Segmentation Divide the virtual address space into segments which directly relate to objects at the programming level. Segments are even vary in size during process execution Protection and sharing is possible at the object level.

51 Page allocation of a process Process A page0 page1 page2 page3 Free frame list In use In use In use Physical memory Process A page0 page1 page2 page page 0 page 1 page 2 In use In use page 3 In use Physical memory

52 Address translation using paging Virutal address Vp Disp PTR Page table PT(Vp) Vp: virtual page number Rp: physical (real) page number Rp Disp Physical (real) address

53 Segmentation allocation subroutine segment 0 function stack segment 1 segment 2 symbol table segment 3 main program segment 4 Logical address space Limit Base Segment table segment 0 segment 3 segment 2 segment 4 segment 1 Physical memory

54 Address translation using segmentation Virutal address Vs Disp STR Segment table ST(Vs) Rs Vs: virtual segment ID Rs: physical (real) base address of a virtual segment Physical (real) address

55 Address translation Virtual address seg. number page number Disp. STP Segment table ST(Vs) Page table of the segment PT(Vp) Rp Disp. Real address

56 The role of TLB Caches the recent address mappings Accelerate address translation by hardware assistance TLB for paged segmentation Access type Virtual address RWX Vs Vp Disp RWX Vs Vp Rp select Rp Disp Physical address

57 Modern large virtual address space requires a large page table or multilevel paging Inverted page table Reduce the page table size One entry for each physical memory page, instead of each virtual page Uses associative search or hashing functions. Vp Disp Hash Link Vp Rp 1 Rp Disp

58 Replacement policies for fixed-size partitions Optimal (MIN) replacement Ideal and used as theoretical policy Random replacement FIFO replacement FINUFO (first-in not-used first-out) Each page has the used flag Recentness problem LRU replacement

59 Two reasons Sharing Divides physical memory into blocks and allocates them to different processes Multi-programming Many processes use only a small part of their address space Share physical memory Relocation Reduce the programmer s job to fit the program into memory Protection Demanded paging Main memory is analogous to cache

62 Placement: fully-associative Large miss penalty Replacement: LRU Provide a use (reference) bit Periodically clear the bit and set whenever a page is accessed Write policy: write back The secondary storage (magnetic disk) is so slow that the write-through policy cannot be used. Address translation TLB (translation lookaside buffer) Small, fully associative cache Sometimes pipelined Page size Large size is preferred.

65 Page size Large size is preferred Pros Page table size is small Fast cache hit time small tag Transferring larger pages to/from secondary storage is more efficient than transferring small pages More memory can be mapped to TLB which has a fixed number of entries Cons More internal fragmentation Longer start-up time

67 Paging Page: fixed sized block: usually 4KB ~ 64KB Virtual addr. = page #: page offset Internal fragmentation Managed by page tables Index by virtual page number Contains physical page number Segmentation Segment: variable sized block: 1B ~ 2 32 B Virtual addr. = segment base + segment offset External fragmentation: needs garbage collection Managed by segment table Contains physical segment base addresses, bound Paged segmentation

68 Address range protection Base < address < bound (base + address) < bound Access permission bits to each page or segment Write protection to code space User/supervisor mode Priority rings CPU states that a user process cannot modify User/supervisor bit Base/bound register Exception enable/disable bit Mechanism to change modes System call

69 Replacement Cache: controlled by hardware VM: controlled by OS Longer miss penalty Size Cache: independent of address size VM: processor address size determines the size of VM Secondary storage Cache: MM VM: file system that is not part of the address space

70 Multiple issue and the number of ports to the cache No sufficient peak bandwidth from cache to match the peak demands of the instructions Multiple ports critical path Speculative execution and the memory system Possibility of generating invalid addresses Memory system must identify exceptions caused by speculatively executed instructions and conditional instructions, and suppress the exceptions Non-blocking cache ILP vs. Reducing cache misses

72 I/O and consistency of cached data Cache-coherency problem IO occurs between IO device and cache Works for both write back and write through Severe interference with the CPU Between IO device and main memory Main memory acting as an IO buffer Non-cacheable block Flush the buffer address from the cache after input occurs Sometimes duplicate set of tags to avoid slowing down the cache to check addresses

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off