Lecture 17. Fall 2007 Prof. Thomas Wenisch. row enable. _bitline. Lecture 18 Slide 1 EECS 470

Size: px

Start display at page:

Download "Lecture 17. Fall 2007 Prof. Thomas Wenisch. row enable. _bitline. Lecture 18 Slide 1 EECS 470"

Esther Webster
6 years ago
Views:

1 Lecture 17 DRAM Memory row enable Fall 2007 Prof. Thomas Wenisch 70 _bitline Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Purdue University, University of Michigan, and University of Wisconsin. Slide 1

2 Announcements Wenisch Portions Austin, Brehob, Falsafi, Milestone 2 (due Wednesday) HW # 5 (due 11/16) Slide 2

3 Readings Wenisch Portions Austin, Brehob, Falsafi, For Today: H&P 5.3 For Wednesday: H&P , , Slide 3

4 Cache Placement and Address Translation Physical Cache (Most Systems) PA CPU VA MMU Physical Cache Physical Memory longer hit time fetch critical path Virtual Cache (SPARC2 s) VA CPU Virtual Cache fetch critical path MMU PA Physical Memory aliasing problem cold start after context switch Virtual caches are not popular anymore because MMU and CPU can be integrated on one chip Slide 4

5 Physically Indexed Cache Virtual Address (n=v+g bits) Virtual Page No. (VPN) Tag Index Page Offset (PO) v-k k g TLB Physical Address (m=p+g bits) p Phy. Page No. (PPN) PO Tag Index BO t i b D-cache Data Slide 5

6 Virtual Cache Wenisch Portions Austin, Brehob, Falsafi, Virtual Address (n=v+g bits) Miss? Tag Index BO t i b L1 D-cache Tag Index Page Offset (PO) v-k TLB k g Physical Address (m=p+g bits) p Phy. Page No. (PPN) PO Tag Index BO L2 Slide 6

7 Virtual Index, Phyisical Tag Cache Parallel Access to TLB and Cache arrays Virtual Pg No. (VPN) Tag Index Page Offset v-k k g TLB p PPN Virtual Pg No. (VPN) Tag Index Page Offset Index BO i b D-cache p p PPN p = Data Hit/Miss How large can a VIPT cache get? Slide 7

8 Large VIPT Cache Virtual Pg No. (VPN) Virtual Pg No. (VPN) Tag Index Page Offset Tag Index Page Offset v-k k Index BO g TLB a i b p D-cache PPN p p PPN p = Hit/Miss Data If two VPNs differs in a, but both map to the same PPN then there is an aliasing problem Slide 8

9 Virtual Address Synonyms To Virtual pages that mapto the same physical page within the same virtual address space across address spaces VA1 PA VA2 Using VA bits as IDX, PA data may reside in different sets in cache!! Slide 9

10 Synonym (or Aliasing) When VPN bits are used in indexing, two virtual addresses that map to the same physical address can end up sitting in two cache lines Virtual Pg No. (VPN) Tag Index Page Offset Index BO In other words, two copies of the same physical memory location may exist in the cache a D-cache p i b modification to one copy won t be visible in the other PPN from TLB PPN p = Data Hit/Miss If the two VPNs do not differ in a then there is no aliasing problem Slide 10

11 Synonym Solutions Limit cache size to page size times associativity get index from page offset Search all sets in parallel 64K 4 way cache, 4K pages, search 4 sets (16 entries) Slow! Restrict tpage placement tin OS make sure index(va) = index(pa) Eliminate by OS convention single virtual space restrictive sharing model Slide 11

12 R10000 s Virtually Index Caches 32KB 2 Way Virtually Indexed L1 needs 10 bits of index and 4 bits of block offset page pg offset is only 12 bits 2 bits of index are VPN[1:0] Direct Mapped Physical L2 L2 is Inclusive of L1 VPN[1:0] is appended to the tag of L2 Given two virtual addresses VA and VB that differs in a and both map to the same physical address PA Suppose VA is accessed first so blocks are allocated in L1&L2 What happens when VB is referenced? 1 VB indexes to a different block in L1and misses 2 VB translates to PA and goes to the same block as VA in L2 3. Tag comparison fails (VA[1:0] VB[1:0]) 4. L2 detects that a synonym is cached in L1 VA s s entry in L1 is ejected before VB is allowed to be refilled in L1 Slide 12

13 MIPS TLB Wenisch Portions Austin, Brehob, Falsafi, 64 entry fully associative unified TLB paired: each entry maps 2 consecutive VPNs to 2 different PPNs software managed TLB entry 7 instruction page table walk in the best case TLB Write Random: chooses a random entry for TLB replacement OS can exclude some number of TLB entry (low range) to be excluded from the random selection, to hold translations that cannot miss or should not miss N: noncacheable D: dirty (actually a write enable bit) V: valid G: global entry, i.e., ignore ASID matching VPN 20 ASID PPN 20 ndvg 0 8 R2000 Slide 13

14 MIPS Bottom-Up Hierarchical Table Page table organization is not part of the ISA Reference design optimized for software TLB miss handling PTEBase VPN PO VA case TLB Miss, trap which address VPN 0s VA of PTE space? (generated automatically mem by HW after TLB miss) load PPN status PTE loaded from mem Can this load miss? What happens if it misses? Slide 14

15 SPARC Top-Down Hierarchical Table SPARC V8 (32 bit): top down 3 level hierarchical page table context for HW MMU page table walk context table L1 Table L2 Table L3 Table +VA [31:24] +VA [23:18] +VA [17:12] descriptors descriptors descriptors PTEs (1024-byte) (256-byte) (256-byte) SPARC V9 (64 bit) defines Translation Storage Buffer a software managed, direct mapped cache of PTEs (think inverted/hashed page table) HW assisted address generation on a TLB miss, eg e.g.,for8 k pages {TSBbase 63:21, Logic(TSBbase 20:13,VA 32:22,size,split?),VA 21:13,0000} TLB miss handler search TSB. If TSB misses, a slower TSB handler takes over Slide 15

16 IBM PowerPC Hashed Page Table VPN 40 Hash Function Hashed Page Table table base + 8 PTE s per group HW table walk **must hold at least N PTE s for a system with 2N physical pages VPN hashes into a PTE group of 8 8 PTEs searched sequentially for tag match if not found in first PTE group search a second PTE group if not found in the 2 nd PTE group, trap to software handler Hashed table structure also used for EA to VA mapping in 64 bit implementations Slide 16

17 n+m bitline n Wenisch Portions Austin, Brehob, Falsafi, Static Random Access Memory m 2 n row select bit-cell array 2 n row x 2 m -col (n m to minmize overall latency) _bitline 2 m diff pairs sense amp and mux 1 Read Sequence 1. address decode 2. drive row select 3. selected bit cells drive bitlines (entire row is read together) 4. diff. sensing and col. select (data is ready) 5. precharge all bitlines (for next read or write) How do you write select columns? Access latency dominated by steps 2, 3 Cycle time dominated by steps 2, 3, 5 step 2 proportional to 2 m step 3 and 5 proportional to 2 n usually encapsulated by synchronous (sometime pipelined) interface logic Slide 17

18 _bitli ine RAS CAS Wenisch Portions Austin, Brehob, Falsafi, Dynamic Random Access Memory n m 2 n row enable bit-cell array 2 n row x 2 m -col (n m to minmize overall latency) 2 m sense amp and mux 1 A DRAM die comprises of multiple such arrays Bits stored as charges on node capacitance (non restorative) bit cell loses charge when read bit cell loses charge over time Read Sequence 1~3 same as SRAM 4. a flip floping floping sense amp amplifies and regenerates the bitline, data bit is mux ed out 5. precharge all bitlines A DRAM controller must periodically, either distributed or in a burst, read all rows within the allowed refresh time (10s of ms) (Overall, the DRAM is not available for about a few % of the time) synchronous interfaces various hacks to allow faster repeated reads to the same row Slide 18

19 Perspectives Wenisch Portions Austin, Brehob, Falsafi, DRAM fabrication at the forefront of VLSI technology nodes, but scales with Moore s law in capacity and cost, not speed Between 1980 ~ 2002 DRAM 64K bit 512M bit (exponential ~55% annual) 250ns 80ns (linear) but, remember, this is a delibrate choice Memory capacity needs to grow linearly with CPU speed to keep a balanced system Amdahl Memory speed reconciled through cache hierarchies (L1, L2, L3 ) Slide 19

20 Simple Main Memory DRAM access takes multiple cycles What is the miss penalty for a 4 word cache block Consider these parameters: 1 cycle to send address 6 cycles to access each word 1 cycle to send word back ( ) x 4 = 32 How can we speed this up? Make memory and bus wider read out all words in parallel Miss penalty for 4 word block = 8 Cost wider bus larger expansion size error correction is harder better bandwidth and latency Slide 20

21 Interleaved Main Memory Divide memory into M banks, interleave addresses across them so that word A is in bank (A mod M) at word (A div M) Bank 0 Bank 1 Bank 2 Bank n word 0 word 1 word 2 word n-1 word n word 2n word n+1 word 2n+1 word n+2 word 2n+2 word 2n-1 word 3n-1 PA Doubleword in bank Bank Word in doubleword Interleaved memory can increase memory bandwidth without a wider bus Use parallelism in memory banks to hide memory latency Slide 21

22 Independent Memory Banks Bus Bandwidth = 1 word per cycle Assume 4 banks of non pipelined DRAM interface accesstime A=2 cycle time A+B=4 transfer time T=1 Cycle Address Bank 0 Bank 1 Bank 2 Bank a 2 13 a a 3 14 b/t a a 4 15 b b/t a a 5 16 a b b/t a 6 17 a a b b/t 7 18 b/t a a b 8 19 b b/t a a 9 b a Copyright 2002 Falsafi, from Hill, Smith, Sohi, Vijaykumar, and Wood Slide 22

23 Independent Banks (Stride of 2) Bank conflict! Cycle Address Bank 0 Bank 1 Bank 2 Bank a 2 14 a a 3 b/t a 4 b b/t 5 16 a b 6 18 a a 7 b/t a 8 b b/t 9 b Slide 23

24 Independent Banks (Stride of 3) Cycle Address Bank 0 Bank 1 Bank 2 Bank a 2 15 a a 3 18 b/t a a 4 21 b a a b/t 5 24 a a b/t b 6 27 a b/t b a 7 30 b/t b a a 8 33 b a b/t 9 b/t b Any relatively prime stride would work well What about random accesses? Slide 24

25 Interleaving Conclusions Interleavingfor sequential accesses: load cache words good for write back caches Independent banking otherwise Do both banks: interleaving i for high h bandwidth superbanks: multiple cache misses non blocking caches and/or multiprocessors How many banks? Slide 25

26 DRAM Evolution Wenisch Portions Austin, Brehob, Falsafi, Survey by Cuppu et al. 1. Fast Page Mode 2. Extended Data Out 3. Synchronous & Enhanced Synchronous DRAM 4. Double Data Rate 5. RAMLink 6. Rambus & Direct Rambus Slide 26

27 Conventional 64MbitDRAM Example from Micron Slide 27

28 Fast Page Mode (FPM) RAS CAS Row add Column add Column add Data Data One row address Multiple column addresses Slide 28

29 Extended Data Out (EDO) RAS CAS Row add Column add Column add Column add Data Data Data As in FPM But overlapped Column Address assert with Data Out Slide 29

30 Synchronous DRAM (SDRAM) RAS CAS Row add Column add Data Data Data Single CAS Strobe, multiple transfers Slide 30

31 Enhancements on SDRAM: Wenisch Portions Austin, Brehob, Falsafi, Enhanced SDRAM & DDR 1. ESDRAM (Enhanced): Overlap row buffer access with refresh 2. DDR (Double Data Rate): Transfer on both clock edges Slide 31

32 RAMBus & Direct RAMBus Developed before DDR Originallyused inhigh end desktops/servers Lost popularity after DDR (and due to lawsuites) RAS/CAS bottleneck => eliminate the interface Packet switched bus to each DRAM Transfers on both clock edges Overlaps requests Direct RAMBus Wider packet bus Separate data address buses Slide 32

33 Power Management Support Trade off between time/power Works well when Napping PwrDwn access penalty prohibitive for high performance systems Not currently supported in installed products? 100x Power 60x Power Active Standby 1.1x Delay 100x Delay 2x Delay PwrDwn Nap 1x Power 10x Power Slide 33

34 Non-conventional DRAM: Embedded d DRAM or IRAM Embedded DRAM: logic in DRAM technology huge on chip DRAM bandwidth => compute in DRAM used as graphics chips can this be used in general purpose computing? what are the implementation problems? Slide 34

EECS 470. Lecture 16 Virtual Memory. Fall 2018 Jon Beaumont

EECS 470. Lecture 16 Virtual Memory. Fall 2018 Jon Beaumont Lecture 16 Virtual Memory Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, and