Pollard s Attempt to Explain Cache Start with (Very) Basic Block Diagram CPU (Actual work done here) (Starting and ending data stored here, along with program) Organization of : Designer s choice 1
Problem with System: Mis-matched Speed CPU CPU Speed: as fast as technology and application allow Big memory: must be slow to be economical One Solution: Small(er), Fast(er) for Active Stuff CPU Cache Main Cache designed so interaction with CPU happens at CPU speed Cache-to-Main Store interaction designed for larger data transfers 2
Step 1: Choose Sizes of Memories Cache Main Cache : 256 KBytes Main : 512 MBytes Step 2: Organize in Lines (Blocks) Cache Main Cache : 256 Kbytes or 8192 Lines with 32 bytes/line Main : 512 Mbytes or 16 M-lines with 32 bytes/line 3
Observation 1: 29 Bits Needed for Main Store Address 28 0 Main Store Address Observation 2: Each (Cache and Main Store) Made up of 32 Byte Lines Cache : 8192 Lines (only 128 shown here) 4
Observation 3: Address Within Line Takes 5 Bits 28 5 4 0 Main Store Address Address Within Line Step 3: Organize Lines into Groups or Sets of Four 0 Cache : 8192 Lines organized in 2048 Sets 2047 5
Observation 4: 2048 Sets Requires 11 Bits of Address 28 7 17 5 4 0 Main Store Address Set Identifier Bits Address Within Line Observation 5: Remainder of Address Bits Form (part of) Tag Tag Address Bits 28 7 17 5 4 0 Main Store Address Set Identifier Bits Address Within Line 6
Step 4: Add Tag Section To Cache Tags 0 Data 2047 Set Number Read Action Step 1: CPU Requests Read At Address CPU Generates Address: Address Separated into Tag, Set Number, Offset within Line CPU Cache Main 7
Read Action Step 2: Cache Checks for Line in Set Cache Compares Tag Bits Against Four Lines in Set: Any Match Results in Cache Hit CPU Cache Main Read Action Step 3: Cache Hit: Data Sent Immediately to CPU On Cache Hit, CPU Continiues Activity Immediately; No Pause In CPU Activity CPU Cache Main 8
Read Action Step 4: On Cache Miss, Controller Requests Line from Main Store CPU Cache Main Cache Miss: Cache Controller Moves Line from Main Store to Cache (four transfers if bus width = 8 bytes) Additional Cache Issues Write Protocol How to handle CPU writes Write Back: write changes handled in cache Write Thru: writes also modifdy main store Set size: how many lines per set Line-in-set selection algorithms vs reality Write Back sequence of events Cache concept and I/O requirements 9
Higher CPU Speeds Leads to Multi-Level Caching CPU Level 1 Cache Level 2 Cache Main Keywords from Text Cache Virtual Direct Mapped Set associative Fully associative Valid bit Block address Write thru Instruction cache Average access time Cache hit Page Miss penalty Dirty bit Block offset Write back Data cache Hit time 10
Keywords from Text Cache miss Page fault Miss rate Least recently used Tag field Write allocate Unified cache Misses per instruction Block/line Locality Address trace Set Random replacement Index field No-write allocate Write buffer Stall Four Hierarchy Ques Q1: Where can block be placed Q2: How is block found Q3: Which block replaced on miss Q4: What happens on write 11
Cache Performance Books version of equation: Average memory access time = hit time + Miss rate Miss penalty Reducing Cache Miss Penalty Technique 1: Multilevel Cache Technique 2: Critical word first, early restart Technique 3: Giving priority to read misses over writes Technique 4: Merging write buffer Technique 5: Victim Caches 12
Reducing Miss Rate Miss types: Compulsory (cold-start, first reference) Capacity (won t fit) Conflict (collision misses, interference misses) Miss rates depend on variety of factors Reducing Miss Rate Technique 1: Larger block size Technique 2: Larger Caches Technique 3: Higher associativity Technique 4: Way Prediction, Pseudoassociative Caches Technique 5: Compiler optimization Loop interchange Blocking 13
Miss Rate Help via Parallelism Nonblocking Caches Reduce Stalls on Cache Misses Hardware Prefetching of Instr, Data Compiler Controlled Prefetching Register prefetch / Cache prefetch Faulting/non-faulting Reducing Hit Time Small and Simple Caches Avoiding Address Translation during Indexing of the Cache Pipelined Cache Access Trace Caches 14
Main Organizations Wider Main Simple Interleaved Independent Banks SRAM DRAM ROM PROM EPROM Flash SSRAM DDR DRAM Technology 15
Virtual Vocabulary of virtual memory: Page / Segment Page size / Segment size Dirty page/segment Page replacement Address mapping TLB Use bit / reference bit Fragmentation Questions, Revisited Where can a page be placed in memory? How is a block found if it is in main memory? Which block should be replaced on a virtual memory miss? What happens on a write? 16
Page Size Selection Page table size (make page size bigger, less space required for page table) Larger page size conducive to larger/faster caches Transferring larger pages more efficient TLB fixed size; larger pages means more address space available at any time Smaller: conserve storage Fallacies and Pitfalls Fallacy: Predicting cache performance of one program from another Pitfall: Simulating enough instructions to get accurate performance measures of the memory hierarchy Pitfall: Too small an address space Pitfall: Emphasizing memory bandwidth in DRAMs versus memory latency 17
Fallacies and Pitfalls Pitfall: Delivering high memory bandwidth in a cache based system Pitfall: Ignoring the impact of the operating system on the performance of the memory hierarchy Pitfall: Relying on the operating systems to change the page size over time 18