Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory

ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab 1 due today Reading: Chapter 5.1 5.3 2 1

Overview How to improve cache performance Recent research: Flash cache 3 Improving Cache Performance Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: 4 2

Small, Simple Caches On many machines today the cache access sets the cycle time Hit time is therefore important beyond its effect on AMAT tag index byte tag index byte = data = = mux data hit hit 5 Way Predicting Caches (MIPS R10000 L2) Use processor address to index into way prediction table Look in predicted way at given index, then: 6 3

Improving Cache Performance Decrease Hit Time Decrease Miss Rate Decrease Miss Penalty 7 Causes for Cache Misses Compulsory: first-reference to a block a.k.a. cold start misses - misses that would occur even with Capacity: cache is too small to hold all data needed by the program - misses that would occur even under Conflict: misses that occur because of collisions due to block-placement strategy - misses that would not occur with 8 4

Effect of Cache Parameters Larger cache size Higher associativity Larger block size 9 Victim Cache (HP7200) CPU RF L1 Data Cache Unified L2 Cache Victim cache is a small asociative back-up cache, added to a direct mapped cache, which holds recently evicted blocks 10 5

Prefetching Speculate on future instruction and data accesses and fetch them into cache(s) Varieties of prefetching Hardware prefetching Software prefetching Mixed schemes What types of misses does prefetching affect? 11 Issues in Prefetching Usefulness Timeliness Cache and bandwidth pollution CPU RF L1 Instruction L1 Data Unified L2 Cache Prefetched data 12 6

Hardware Instruction Prefetch Alpha 21064 CPU RF Req block Stream Buffer L1 Instruction Prefetched instruction block Req block Unified L2 Cache 13 Hardware Data Prefetching Prefetch-on-miss: One Block Lookahead (OBL) scheme Strided prefetch 14 7

Software Prefetching Compiler-directed prefetching compiler can analyze code and know where misses occur for(i=0; i < N; i++) { prefetch( &a[i + 1] ); prefetch( &b[i + 1] ); SUM = SUM + a[i] * b[i]; } Issues? What property do we require of the cache for prefetching to work? 15 Compiler Optimizations Restructuring code affects the data block access sequence Group data accesses together to improve spatial locality Re-order data accesses to improve temporal locality Prevent data from entering the cache Useful for variables that will only be accessed once before being replaced Needs mechanism for software to tell hardware not to cache data (instruction hints or page table bits) Kill data that will never be used again Streaming data exploits spatial locality but not temporal locality Replace into dead cache locations 16 8

Array Merging Some weak programmers may produce code like: int val[size]; int key[size]; and proceed to reference val and key in lockstep Problem? 17 Loop Interchange for(j=0; j < N; j++) { for(i=0; i < M; i++) { x[i][j] = 2 * x[i][j]; } } What type of locality does this improve? 18 9

Loop Fusion for(i=0; i < N; i++) for(j=0; j < M; j++) a[i][j] = b[i][j] * c[i][j]; for(i=0; i < N; i++) for(j=0; j < M; j++) d[i][j] = a[i][j] * c[i][j]; What type of locality does this improve? 19 Blocking for(i=0; i < N; i++) for(j=0; j < N; j++) { r = 0; for(k=0; k < N; k++) r = r + y[i][k] * z[k][j]; x[i][j] = r; } x y z j k j i i k Not touched Old access New access 20 10

Blocking for(jj=0; jj < N; jj=jj+b) for(kk=0; kk < N; kk=kk+b) for(i=0; i < N; i++) for(j=jj; j < min(jj+b,n); j++) { r = 0; for(k=kk; k < min(kk+b,n); k++) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; } x y z j k j i i k 21 Improving Cache Performance Decrease Hit Time Decrease Miss Rate Decrease Miss Penalty Some cache misses are inevitable when they do happen, want to service as quickly as possible 22 11

Multilevel Caches A memory cannot be large and fast Increasing sizes of cache at each level CPU L1 L2 DRAM AMAT = Hit time L1 + Miss Rate L1 x Miss Penalty L1 Miss Penalty L1 = Hit time L2 + Miss Rate L2 x Miss Penalty L2 What is 2 nd -level miss rate? local miss rate number of cache misses / cache accesses global miss rate number of cache misses / CPU memory refs 23 L2 Cache Hit Time vs. Miss Rate 24 12

Reduce Read Miss Penalty CPU RF Data Cache Write buffer Unified L2 Cache Let read misses bypass writes Problem? Solution? 25 Early Restart Decrease miss penalty with no new hardware well, okay, with some more complicated control Strategy: impatience! There is no need to wait for entire line to be fetched Early Restart as soon as the requested word (or double word) of the cache block arrives, let the CPU continue execution If CPU references another cache line or a later word in the same line: stall Early restart is often combined with the next technique 26 13

Critical Word First Improvement over early restart request missed word first from memory system send it to the CPU as soon as it arrives CPU consumes word while rest of line arrives Example: 32B block (8 words), miss on address 20 0 4 8 12 16 20 24 28 words return from memory system as follows: 27 Sub-blocking Tags are too large, i.e., too much overhead Simple solution: Larger blocks, but miss penalty could be large. Sub-block placement (aka sector cache) A valid bit added to units smaller than the full block, called sub-blocks Only read a sub-block on a miss If a tag matches, is the word in the cache? 100 300 204 1 1 1 1 1 1 0 0 0 1 0 1 28 14

Recent Research: Flash Cache Slides from Taeho Kgil and Trevor Mudge (U of Michigan) Roadmap for Memory - Flash 2005 2007 2009 2011 2013 Cell Density of SRAM MBytes / cm 2 11 17 28 46 74 Cell Density of DRAM MBytes / cm 2 153 243 458 728 1,154 Cell Density of NAND 339 / Flash Memory (SLC / MLC) MBytes / cm 2 713 604 / 1,155 864 / 1,696 1,343 / 5,204 2,163 / 8,653 From ITRS 2005 Roadmap 29 Power/Performance of DRAM and Flash Density (Gb /cm 2 ) $/Gb Active Power Idle Power Read latency Write latency Erase latency DRAM 0.7 48 878mW 80mW 55ns 55ns N/A NAND 1.42 21 27mW 6μW 25μs 200μs 1.5ms Performance and power for 1Gb memory from Samsung Datasheets 2003 Flash good for idle power optimization 1000 less power than DRAM Flash not so good for low access latency usage model DRAM still required for decent access latencies 30 15

Cost of Power and Cooling 50% 25% World wide cost of purchasing and operating servers Source: IDC 31 System-Level Power Consumption SunFire T2000 Power running SpecJBB Processor 26% AC/DC conversion 15% Fans 10% 16GB memory 22% Disk 4% I/O 22% Processor 16GB memory I/O Disk Service Processor Fans AC/DC conversion Total Power 271W From Sun talk given by James Laudon 32 16

Network bandwidth - Mbps (Throughput) Case for Flash as 2 nd Disk Cache Many server workloads use a large working-set (100 s of MBs ~ 10 s of GB and even more) Large working-set is cached to main memory to maintain high throughput Large portion of DRAM to disk cache Many server applications are read intensive than write intensive 32GB DRAM on SunFire T2000 consumes idle power 45W Flash memory consumes 1000x less idle power than DRAM Use DRAM for recent and frequently accessed content and use Flash for not recent and infrequently accessed content Client requests are spatially and temporally a zipf like distribution Ex) 90% of client requests are to 20% of files 33 Latency vs. Throughput 1,200 MP4 MP8 MP12 1,000 800 600 400 200 0 12us 25us 50us 100us 400us 1600us disk cache access latency to 80% of files Specweb99 34 17

Lifetime - years Overall Architecture Processors 1GB DRAM Main memory HDD ctrl. Hard Disk Baseline Drivewithout FlashCache 35 Flash Lifetime 10000.00 SURGE SPECWeb99 Financial1 WebSearch1 1000.00 100.00 10.00 1.00 0.10 0.01 0% 20% 40% 60% 80% 100% Flash memory size (percentage of working set size) 36 18

max. tolerable P/E cycles Programmable Flash Controller Programmable Flash memory controller GF Field LUT BCH configuration Descriptor Density Descriptor BCH Encode / Decode P/E Descriptor Flash Address Flash Data (Writes) Bit Error (Yes / No) External Interface CRC Encode / Decode Flash Density Control NAND Flash Memory Flash Data (Reads) Flash Program / Erase time control 37 Impact of ECC 7,100,000 stdev = 0 stdev = 10% of mean stdev = 5% of mean stdev = 20% of mean 6,100,000 5,100,000 4,100,000 3,100,000 2,100,000 1,100,000 100,000 0 2 4 6 8 10 12 # of correctable errors (code strength) 38 19

Network Bandwidth - Mbps Lifetime - years Flash Lifetime w/ Programmable Controller 10000.00 SURGE SPECWeb99 Financial1 WebSearch1 1000.00 100.00 10.00 1.00 0.10 0.01 0% 20% 40% 60% 80% 100% Flash memory size (percentage of working set size) 39 Overall Performance - Mbps 1,200 MP4 MP8 MP12 1,000 800 600 400 200 0 DRAM 32MB + FLASH 1GB DRAM 64MB + FLASH 1GB DRAM 128MB +FLASH 1GB DRAM 256MB +FLASH 1GB DRAM 512MB +FLASH 1GB DRAM 1GB Specweb99 40 20

Overall Power - W Overall Main Memory Power 3 2.5 2.5W read power write power idle power 2 1.5 1 0.5 1.6W 0.6W 0 DDR2 1GB active DDR2 1GB powerdown DDR2 128MB + Flash 1GB SpecWeb99 41 21