Background. Memory Hierarchies. Register File. Background. Forecast Memory (B5) Motivation for memory hierarchy Cache ECC Virtual memory.

Size: px

Start display at page:

Download "Background. Memory Hierarchies. Register File. Background. Forecast Memory (B5) Motivation for memory hierarchy Cache ECC Virtual memory."

Bryce Johnston
5 years ago
Views:

1 Memory Hierarchies Forecast Memory (B5) Motivation for memory hierarchy Cache ECC Virtual memory Mem Element Background Size Speed Price Register small 1-5ns high?? SRAM medium 5-25ns $ DRAM large ns $5-$10 Disk large 10-20ms $ EE365 Lecture Notes: Chapter 7 1 EE365 Lecture Notes: Chapter 7 2 Background Register File Need basic element to store a bit - latch, flip-flop, capacitor Memory is logically a 2D array of #locations x data-width e.g., 16 registers 32 bits each is a 16 x 32 memory (4 address bits; 32 bits of data) today s main memory chips are 8M x 8 (23 address bits; 8 bits of data) 32 FF in parallel => one register 16 registers one 16-way mux per read port decode write enable can use tri-state and bus for each port EE365 Lecture Notes: Chapter 7 3 EE365 Lecture Notes: Chapter 7 4

2 SRAM Static RAM does not lose data like DRAM 6T CMOS cell pass transistors as switch bit lines, word lines SRAM interface Today - 2M x 8 in 5-15ns Typical large implementations (512 x 64) x 8 DRAM Dense memory 1 T cell forgets data on read and after a while e.g., 16M x 1 in 4k x 4k array 24 address bits - 12 for row and 12 for column Implementation writeback row to restore destroyed value Refresh - in background, march through reading all rows Interface reflects internal orgn. - addr/2, RAS, CAS, data EE365 Lecture Notes: Chapter 7 5 EE365 Lecture Notes: Chapter 7 6 Optimizations Motivation for Hierarchy Give faster access to some bits of row static column - change column address page mode - change column address & CAS hit (EDO) nibble mode - fast access to 4 bits Bigger changes in future bandwidth inside >> external bandwidth 8kb/50ns/chip >> 8b/50ns/chip 164 Gb/s >> 20 Mb/s RAMBUS, IRAM, etc CPU wants memory reference/insn * bytes-per-reference * IPC/Cycle 1.2*4*1/2ns = 2.4 GB/s CPU can go only as fast as memory can supply EE365 Lecture Notes: Chapter 7 7 EE365 Lecture Notes: Chapter 7 8

3 Motivation for Hierarchy Want memory with fast access (e.g., one 2 ns CPU cycle) large capacity (1 GB) inexpensive ($1/MB) Incompatible requirements Fortunately memory references are not random! Motivation for Hierarchy Locality in time (temporal locality) if a datum is recently referenced, it is likely to be referenced again soon Locality in space (spacial locality) If a datum is recently referenced, neighbouring data is likely to be referenced soon EE365 Lecture Notes: Chapter 7 9 EE365 Lecture Notes: Chapter 7 10 Motivation for Hierarchy Motivation for Hierarchy E.g., researching term paper - don t look at all books at random if you look at a chapter in one book temporal - may re-read the chapter again spatial - may read neighbouring chapters Solution - leave the book on desk for a while hit - book on desk miss - book not on desk miss ratio - fraction not on desk Memory access time = access-desk + miss-ratio * access-shelf * << 100 Extend this to several levels of hierarchy EE365 Lecture Notes: Chapter 7 11 EE365 Lecture Notes: Chapter 7 12

4 Memory Hierarchy Memory Hierarchy Small, fast, inexpensive memory larger, slower, cheaper memory... largest, slowest, cheapest memory CPU L1 L2 larger Type Size Speed (ns) Bandwid ($/MB) Price ($) Register < 1 KB >> 100 Cache < 512 KB Main memory < 512 MB L3 Disk > 1 GB 20 x faster Ln EE365 Lecture Notes: Chapter 7 13 EE365 Lecture Notes: Chapter 7 14 Memory Hierarchy Memory Hierarchy Registers <-> Main memory: managed by compiler/programmer holds expression temporaries holds variables - more aggressive register allocation spill when needed hard! Main memory <-> Disk: managed by program - explicit I/O operating system - virtual memory illusion of larger memory protection transparent to user EE365 Lecture Notes: Chapter 7 15 EE365 Lecture Notes: Chapter 7 16

5 Cache Cache cache managed by hardware keep recently accessed block temporal locality break memory into blocks (several bytes) spatial locality transfer data to/from cache in blocks CPU $ put block in block frame state (e.g., valid) address tag data Main Memory EE365 Lecture Notes: Chapter 7 17 EE365 Lecture Notes: Chapter 7 18 Cache Cache Example on memory access if incoming tag == stored tag then HIT else MISS << replace old block >> get block from memory put block in cache return appropriate word within block Memory words: 0x11c 0xe0e0e0e0 0x120 0xffffffff 0x124 0x x128 0x x12c 0x x130 0xabababab EE365 Lecture Notes: Chapter 7 19 EE365 Lecture Notes: Chapter 7 20

6 Cache Example a 16-byte cache block frame: state tag data invalid 0x????? lw $4, 0x128 Is tag ox120 in cache? (0x128 mod 16 = 0x128 & 0xfffffff0) Return 0x7 to CPU to put in $4 lw $5, 0x124 Is tag 0x120 in cache? Yes, return 0x1 to CPU Cache Example No, get block state tag data valid 0x129 0xffffffff, 0x1, 0x7, 0x3 EE365 Lecture Notes: Chapter 7 21 EE365 Lecture Notes: Chapter 7 22 Cache Example Cache Often cache 1 cycle main memory 20 cycles Performance for data accesses with miss ratio 0.1 mean access = cache access + miss ratio * main memory access 4 questions where is block placed how is block found which block is replaced what happens on a write = * 20 = 1.2 Typically caches 64K, main memory 64M 20 times faster 1/1000 capacity but contains 98% of references EE365 Lecture Notes: Chapter 7 23 EE365 Lecture Notes: Chapter 7 24

7 Simple cache first block size = 1 word direct-mapped 16K words (64KB) index - 14 bits tag - 16 bits Cache Design Cache Design Hit Miss place? replace? Build 64K with 16-byte blocks What if blocks conflict? Fully associative cache CAM cells hold D and D ; incoming bits B and B match = AND (B i *D i + B i *D i ) compromise - set associative cache EE365 Lecture Notes: Chapter 7 25 EE365 Lecture Notes: Chapter 7 26 Cache Design Cache Design 3C model Conflict Capacity Compulsory Q3. Which block is replaced LRU random Q4. What happens on a write? write hit must be slower propagate to memory? immediately - write-through on replacement - write-back EE365 Lecture Notes: Chapter 7 27 EE365 Lecture Notes: Chapter 7 28

8 Cache Design Exploit spatial locality bigger block size may increase miss penalty Reduce conflicts more associativity may increase cache hit time Cache Design Unified vs. split instruction and data cache Example consider building 16K I and D cache or a 32K unified cache let t cache be 1 cycle and t memory be 10 cycles EE365 Lecture Notes: Chapter 7 29 EE365 Lecture Notes: Chapter 7 30 Cache Design Cache Design I and D split cache (a) I miss is 5% and D miss is 6% 75% references are instruction fetches t avg = ( *10)* ( *10) * 0.25 = 1.5 Unified cache t avg = *10 = 1.4 WRONG! Multi-level caches Many systems today have a cache hierarchy E.g., 16K I-cache 16K D-cache 1M L2-cache t avg = cycles-lost-to-interference will cycles-lost-to-interference be < 0.1? NOT for modern pipelined processors! EE365 Lecture Notes: Chapter 7 31 EE365 Lecture Notes: Chapter 7 32

9 Cache Design Why? Processors getting faster w.r.t. main memory want larger caches to reduce frequency of costly misses but larger caches are slower! Solution: Reduce cost of misses with a second level cache exploits today s technology can t put large cache on microprocessor board designer can vary cost/performance CPU and Cache Performance Cache only miss ratio average access time Integrate - assume cache hits are part of the pipeline Time/prog = insn/prog * cycles/insn * sec/cycle CPI = (execution cycles + stall cycles)/insn CPI = execution cycles/insn + stall cycles/insn EE365 Lecture Notes: Chapter 7 33 EE365 Lecture Notes: Chapter 7 34 CPU and Cache Performance Stall cycles/insn = read stall cycles/insn + write stall cycles/insn read stall cycles/insn = read/insn * miss ratio * read miss penalty write stall cycles/insn = more complex - write through, write back, write buffer? CPU and Cache Performance Example CPI with ideal memory is 1.5 Assume IF and write never stall How is CPI degraded if loads are 25% of all insns loads miss 10% and miss cost is 20 cycles CPI = *0.10*20 = 2 2/1,5 = 33% slower EE365 Lecture Notes: Chapter 7 35 EE365 Lecture Notes: Chapter 7 36

10 Main Memory Main Memory Each memory access 1 cycle address 5 cycle DRAM (really 10+) 1 cycle data 4 word cache block one word wide: adddddbdddddbdddddbdddddbdddddb *(5+1) = 25 Four word wide: adddddb = 7 Interleaved (pipelined) adddddb ddddd b ddddd b ddddd b = 10 EE365 Lecture Notes: Chapter 7 37 EE365 Lecture Notes: Chapter 7 38 Error Correcting Codes (ECC) Read ECC stuff in Appendix B Assume small number of random errors - bit(s) get flipped So in 1 word no errors > single error > two errors > >2 errors Detection - signal a problem Correction - restore data to correct value Most common Parity - single error detection SECDED - single error correction; double bit detection ECC Power correct #bits comments nothing 0, 1 1 SED 00, , 10 detect errors SEC 000, 111 SECDED 0000, , 010, 100 => , 101, 011 => one 1 => 0000 two 1 s => error three 1 s => 1111 EE365 Lecture Notes: Chapter 7 39 EE365 Lecture Notes: Chapter 7 40

11 ECC For SECDED # 1 s result 0 error 1 Hamming distance no. of changes to convert one code to another All legal SECDED codes must be at Hamming distance 4 ECC Reduce overhead by doing codes on word, not bit overhead # bits SED SECDED 1 1(100%) 3(300%) 32 1 (3%) 7 (22%) 64 1 (1.6%) 8 (13%) n 1 (1/n) 1 + log 2 n + a little EE365 Lecture Notes: Chapter 7 41 EE365 Lecture Notes: Chapter 7 42 ECC ECC 64 bits data 8 bits check To store dddd...d ccccccc use eight by 9 SIMMs = 72 bits Intuition one check bit is parity other check bits point to error in data error in all check bits no error use data 0 to compute check 0 store data 0 and check 0 To load read data 1 and check 1 use data 1 to compute check 2 syndrome = check 1 xor check 2 EE365 Lecture Notes: Chapter 7 43 EE365 Lecture Notes: Chapter 7 44

12 ECC Virtual Memory Basic idea move data from disk and main memory like caches to/from main memory But miss penalty for first byte is 1M cycles, not therefore engineered differently later, we will return to the 4 questions EE365 Lecture Notes: Chapter 7 45 EE365 Lecture Notes: Chapter 7 46 Virtual Memory Virtual Memory Blocks are called pages typically 4K-16K fixed size per system Picture Architecture presents programs with a simple view memory addressed with 32-bit addresses lw $1, 0x => 0x is the virtual address system maps VA to physical address (PA) 0x > 0xF028 (page 15, offset 28 for 4K page) someone else and I run unrelated programs each lw $1, 0x VA must map to different PA Thus, VA allows use more physical memory than system has think it is the only program running in memory think it always starts at address 0x0 be protected from rogue programs start running when most of the program is still on disk EE365 Lecture Notes: Chapter 7 47 EE365 Lecture Notes: Chapter 7 48

13 Virtual Memory A VA miss is called a page fault an exception that saves the PC OS gains control and initiates disk access OS usually runs someone else in the meantime interrupt when disk access is complete original instruction restarts Address Translation VA -> PA E.g., 4K pages Use page tables of 4B PTEs index with page offset address of PTE = PTBR + page offset*4 Unlike cache misses, why is OS used to handle a page fault? EE365 Lecture Notes: Chapter 7 49 EE365 Lecture Notes: Chapter 7 50 Address Translation Translation Buffer PTE contains page frame number valid bit protection bits Each program has own PT; switch by chaging PTBR VM causes 100% overhead - 2 memory accesses - PTE + data! What to do? temporal and spatial locality Translation (Lookaside) Buffer a cache of translations valid tag data valid page# page frame# rest of PTE ? could make Fully/Set associative/direct mapped EE365 Lecture Notes: Chapter 7 51 EE365 Lecture Notes: Chapter 7 52

14 Example Virtual Memory 64 entries, FA, maps 64*4K = 256 bytes Figure Virtual address caches are also possible faster but synonym problem 4 Questions where is a page placed fully associative - any page on any frame How is page found not associative search but indirection through PT On context switch change PTBR either flush TLB or add PIDs to TLB tags EE365 Lecture Notes: Chapter 7 53 EE365 Lecture Notes: Chapter 7 54 Virtual Memory Protection Which page is replaced approx LRU clock use page reference bit What happens on a write write-backs use page dirty bit User VAs map to different PAs - no overlap But may want sharing user-user user-kernel (mode bit, syscall interface) In PTE and TLB entry invalid (had before) read-only read-write (had before) EE365 Lecture Notes: Chapter 7 55 EE365 Lecture Notes: Chapter 7 56

15 Page Table Size How big is the PT? 2 32 /4K * 4 = 4M per program To make smaller define a limit register do limit registers for a few regions - stack, heap page a part of PT (terminate recursion) Segmented VA (noncontiguous alloc, segment table->pt) use Hash table to map PA-VA - called inverted PT More Optimizations Non-blocking caches handle hits under misses Interleaved/banked caches multiple requests simultaneously (poor-man s multiporting) Write Buffers miss penalty of dirty blocks Out-of-order CPU tolerate cache hit and miss latencies EE365 Lecture Notes: Chapter 7 57 EE365 Lecture Notes: Chapter 7 58 More Optimizations Real Stuff Compiler optimizations get rid of memory accesses (register allocation, reuse) improve locality (blocking, tiling) insert prefetch code scheduling DEC Alpha (550 MHz) L1 cache 4 way out-of-order CPU pipeline 2 loads/stores per cycle (phase pipelined) 3 cycles hit latency, 8+ GB/s bandwidth L2 cache 12 cycle hit latency, 4+ GB/s bandwidth System interface 64 bit bus, 80 cycle latency, 2+ GB/s bandwidth EE365 Lecture Notes: Chapter 7 59 EE365 Lecture Notes: Chapter 7 60

16 Real Stuff Charac Pentium Pro PowerPC VA 32 bits 52 bits PA 32 bits 32 bits Page size 4 KB, 4 MB 4 KB, selectable, 256 MB TLB split I and D 4-way assoc pseudo random I - 32, D - 64 TLB miss H/W split I and D 2-way assoc LRU I - 128, D- 128 TLB miss H/W Real Stuff Charac Pentium Pro PowerPC cache split I and D split I and D size 8KB each 16 KB each assoc 4-way 4-way replace approx LRU LRU block 32 bytes 32 bytes write write-back write-back or write-through EE365 Lecture Notes: Chapter 7 61 EE365 Lecture Notes: Chapter 7 62 Summary Summary Temporal and spatial locality, Memory hierarchy Cache design - block size, associativity, write back/through Multilevel cache hierarchies Virtual memory, translation (VA -> PA), page table (PT) VM design - page size, FA through PT, reference bit, dirty bit Fast translations - TLB Protection, page faults (exceptions) 4 Questions - cache, VM, TLB Where can a block be placed one (DM), a few (SA), any (FA) How is a block found indexing (DM), search (SA/FA), table lookup (PT) What is replaced on a miss LRU or random How are writes handled write through or write back; write back for VM EE365 Lecture Notes: Chapter 7 63 EE365 Lecture Notes: Chapter 7 64

Main Memory (Fig. 7.13) Main Memory

Main Memory (Fig. 7.13) Main Memory Main Memory (Fig. 7.13) CPU CPU CPU Cache Multiplexor Cache Cache Bus Bus Bus Memory Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3 Memory b. Wide memory organization c. Interleaved memory organization