Main Memory Systems. Department of Electrical Engineering Stanford University Lecture 5-1

Size: px

Start display at page:

Download "Main Memory Systems. Department of Electrical Engineering Stanford University Lecture 5-1"

Silas Warren
5 years ago
Views:

1 Lecture 5 Main Memory Systems Department of Electrical Engineering Stanford University Lecture 5-1

2 Announcements If you don t have a group of 3, contact us ASAP HW-1 is due on 10/15, 5pm no extensions, no exceptions) Bring to lecture or drop off in box outside Gates Hall 310 PA-1 will be out on Thu Discussion session on PA1 Fri 10/10, 11am, Skilling 193 START EARLY! Lecture 5-2

3 Review: Prefetching Idea: fetch data into the cache before processors request them Can address cold misses Can be done by the programmer, compiler, or hardware Characteristics of ideal prefetching You only prefetch data that are truly needed Avoid bandwidth waste You issue prefetch requests early enough To hide the memory latency You don t issue prefetch requests too early To avoid cache pollution Lecture 5-3

4 Review: Stream Prefetching or Stream Buffers Sequential prefetching problem: Performance slows down once every N cache lines Stream prefetching is a continuous version of sequential prefetching Stream buffer can fit N cache lines On a miss, start fetching N sequential cache lines On a stream buffer hit: Move cache line to cache, start fetching line (N+1) In other words, stream buffer tries to stay N cache lines ahead Design issues When is a stream buffer released When we miss both in the cache and the stream buffer Can use multiple stream buffers to capture multiple streams E.g. a program operating on 2 arrays Lecture 5-4

5 Stream Buffer Design Lecture 5-5

Strided Prefetching Idea: detect and prefetch strided accesses for (i=0; i<n; i++) A[i+1024]++; Stride detected using a PC-based table For each PC, remember the stride Stride detection Remember the

6 Strided Prefetching Idea: detect and prefetch strided accesses for (i=0; i<n; i++) A[i+1024]++; Stride detected using a PC-based table For each PC, remember the stride Stride detection Remember the last address used for this PC Compare to currently used address for this PC Track confidence using a two bit saturating counter Increment when stride correct, decrement when incorrect How to use the PC-based table When stream prefetching is initialized, direct to fetch strided Everything else remains the same Lecture 5-6

7 Other Ideas in Prefetching Prefetch engines for pointer-based data structures Predict if fetched data contain a pointer & follow it Works for linked-lists, graphs, etc Must be very careful: What is a pointer? How far to prefetch? Correlating prefetchers Learn about address correlation (ABC always accessed in order) When A is accessed, immediately fetch B & C Can use a PC-based table or a Markov prefetcher Pre-execution or run-ahead Distill the part of the program that generates addresses Run this program on other processor/thread to generate prefetches Lecture 5-7

8 Today s Menu: Main Memory Systems Memory basics DRAM Vs SRAM DRAM Basic operation System organization DRAM chip architectures DRAM controller How to improve the memory system bandwidth and latency Acknowledgements: Bruce Jacob, University of Maryland Extensive research & teaching on modern DRAMs See two optional papers online Lecture 5-8

9 Computer System (PC) Overview Lecture 5-9

10 General Memory Background Read access sequence: Address Address Register MS bits Row Decoder 2D Storage Array 1. Decode row address & drive word-lines 2. Selected bits drive bit-lines Entire row read 3. Amplify row data 4. Decode column address & select subset of row Send to output LS bits Column Decoder 5. Precharge bit-lines For next access Data Out Lecture 5-10

11 Memory Terminology Access time (latency) Time from issuing and address to data out Cycle time Minimum time between two request (repeat rate) Bandwidth Bytes/unit of time we can extract from the memory Peak: ignore initial latency Sustained: include initial latency Concurrency Number of accesses executing in parallel or overlapped manner Can help increase bandwidth or improve latency Lecture 5-11

SRAM vs DRAM 6-transistor storage cell 1-transistor+1-capacitor storage cell Retains value if power is on Requires refresh Non destructive reads Destructive reads Cycle time==access time Cycle time >

12 SRAM vs DRAM 6-transistor storage cell 1-transistor+1-capacitor storage cell Retains value if power is on Requires refresh Non destructive reads Destructive reads Cycle time==access time Cycle time > access time Wide interfaces Narrower interfaces (4b to 32b) Typical product today Typical product today 1-16Mbit, 2-15ns access time 64Mb-1Gb, 5-40ns access time, 8-60ns cycle time Word Line C... Bit Line Sense Amp Lecture 5-12

13 SRAM Vs DRAM: Considerations SRAM is preferable for register files & L1/L2 caches Fast access No refreshes Simpler manufacturing DRAM is preferable for stand-alone memory chips Much higher capacity 10x and growing Better immunity to soft error rates Latency dominated by board traces anyway There is some gray area in the midle Lecture 5-13

14 DRAM Basic Operation Lecture 5-14

15 Basic DRAM Operation (1) Lecture 5-15

16 Basic DRAM Operation (2) Lecture 5-16

17 Basic DRAM Operation (3) Lecture 5-17

18 Basic DRAM Operation (4) Not shown: precharge time, refresh time Lecture 5-18

19 Latency Components Basic DRAM Operation CPU controller transfer time Controller latency Queuing & scheduling delay at the controller Access converted to basic commands Controller DRAM transfer time DRAM latency Simple CAS is row is open OR RAS + CAS if array precharged OR PRE + RAS + CAS (worst case) DRAM CPU transfer time (through controller) Lecture 5-19

20 DRAM Latency Examples Often quoted trc = RAS + PRE trac = RAS + CAS Faster DRAMs are possible, but are more expensive Non commodity Lecture 5-20

DRAM DIMMs Dual Inline Memory Module (DIMM) A PCB with 8 to 16 DRAM chips All chips receive identical control and addresses Data pins from all chips are directly connected to PCB pins Advantages: A

21 DRAM DIMMs Dual Inline Memory Module (DIMM) A PCB with 8 to 16 DRAM chips All chips receive identical control and addresses Data pins from all chips are directly connected to PCB pins Advantages: A DIMM acts like a high-capacity DRAM chip with a wide interface E.g. use 8 chips with 8-bit interfaces to connect to a 64-bit memory bus Easier to replace/add memory in a system No need to solder/remove individual chips Disadvantage: memory granularity problem Lecture 5-21

22 Multi-DIMM SDRAM Memory System Lecture 5-22

23 DRAM Banks Banks are independent arrays WITHIN a chip DRAMs today have 4 to 32 banks SDRAM/DDR SDRAM system: 4 banks RDRAM system: banks Advantages Lower latency Higher bandwidth by overlapping Finer-grain power management Disadvantages Bank area overhead More complicated control Lecture 5-23

24 How Do Multiple Banks Help Addr Bus DRAM Data Bus A 0 A 1 A 2 Wait for DRAM access Wait for DRAM access Wait D 0 D 1 Before: No Overlapping Assuming accesses to different DRAM rows Addr Bus A 0 A 1 A 2 A 3 DRAM Data Bus Wait for DRAM bank 0 Wait for DRAM bank 1 Wait Wait D 0 D 1 D 2 D 3 After: Overlapped Accesses Assuming no bank conflicts Lecture 5-24

25 DRAM Ranks A group of chips that responds to a single command & returns data E.g. half the chips in on a two-sided DIMMs SDRAM/DDR SDRAM system: 4~6 ranks RDRAM system: 32 ranks Lecture 5-25

26 DIMMS Revisited Lecture 5-26

27 DRAM Channels (Physical & Logical) Why more channels Increase bandwidth Cost More board wires More resources in controller Less if single logical channel Multiple physical, one logical channel More over-laping across banks No parallel accesses Lecture 5-27

28 How Do Multiple Banks/Ranks/Channels Help Addr Bus DRAM Data Bus A 0 A 1 A 2 Wait for DRAM access Wait for DRAM access Wait D 0 D 1 Before: No Overlapping Assuming accesses to different DRAM rows Addr Bus A 0 A 1 A 2 A 3 DRAM Data Bus Wait for DRAM bank 0 Wait for DRAM bank 1 Wait Wait D 0 D 1 D 2 D 3 After: Overlapped Accesses Assuming no bank conflicts Lecture 5-28

29 Address Mapping Examples (aka Address Interleaving What are the tradeoffs? Think about sequential patterns initially What is fast and what is slow in memory accesses? What about non-sequential accesses? Other issues? Lecture 5-29

30 DRAM Controllers Their role Generate proper controls for DRAM DIMMs for each access Schedule across banks & potentially reorder DRAM accesses Involves queuing & buffering Their location In the chipset/memory controller/north bridge In the processor chip Reduces latency & improve BW between CPU & controller What makes them complicated Variability of timings across different systems/dram chips Ordering requirements Trade-off between latency and bandwidth Lecture 5-30

31 DRAM Controller Topologies Tradeoffs? See optional paper for examples Available on -line Lecture 5-31

32 DRAM Controller Scheduling Policies Bank precharging: open or closed Open: leave row open until new row request Closed: precharge bitlines as soon as current burst satisfied Power mode Active, stand-by, self-refresh, power-down Basic ordering: In-order, load-over-store, bank-ready, age-threshold, Remember that ordering matters across banks as well All banks share same IO pins Advanced ordering: Open row first, row with most pending, row with less pending, Lecture 5-32

33 DRAM Evolution: SDRAM & DDR SDRAM: 1 st synchronous DRAM 66 to 133MHz with multiplexed address bus 4 banks Programmable burst (1 to 8) DDR SDRAM: double data rate (both clock edges) 100 to 266MHz with multiplexed address bus 4 banks Programmable burst (2 to 8) DDR2 200 to 333MHz, 4 banks, 4-8 burst, Over time: Clock, minimum burst, banks, Lecture 5-33

34 DDR Vs. Rambus 200MHz 64-bit bus 800MHz 16-bit bus Many banks/chip (4-32) Narrow fast interconnect (pipelined) High bandwidth Latency & area penalty Lecture 5-34

35 Other DRAM Options GDDRx: DRAM specialized for graphics Unidirectional signaling, higher clock rate, lower trc, RLDRAM/FCDRAM: reduced latency / fast cycle DRAM Mostly targeted toward L3 caches & telecom gear Wider bus, low trc/trac, non multiplexed address bus, small bursts ESDRAM: 1T SRAM (SRAM replacement) 16 banks, hidden refresh, 4-6 cycle latency, large bursts VCDRAM: virtual channel DRAM Includes a small SRAM cache Mobile DRAM Low cost and low power design, hidden refresh Lecture 5-35

36 Fully Buffered DIMM (FB-DIMM) The DDR problem Higher capacity more DIMMs lower data-rate (multidrop bus) FBDIMM approach: use point-to-point links While still using commodity DRAM chips Network with 12-beat packages, separate up/downstream wires Lecture 5-36

37 Advanced Memory Buffer Lecture 5-37

38 Fully Buffered DIMM (FB-DIMM) Watch out for: Asymmetric upstream/downstream Requires deep channel for maximum bandwidth efficiency Power overhead of current generation AMBs Lecture 5-38

39 System Level Choices for DRAM Lecture 5-39

40 What Processor Vendors Are Currently Supporting Lecture 5-40

41 How to Select a DRAM Architecture Don t just make a decision based on specs! Bandwidth: measure for your own workload Mix of reads/writes, bursts, locality, strides, Different architectures/chips are optimized for different cases Latency: typically not critical but Don t forget other latency contributors (e.g. DRAM controller) Cost: pins (board traces), signaling, cost/dram bit Power: Voltage, power modes, Risk: Number of suppliers Lecture 5-41

42 DRAM Trends to Keep in Mind DRAMs: capacity +60%, cost 30% per year 2.5x cells/area, 1.5x die size in -3 years 98 DRAM fabrication line costs $2B DRAM only: density, leakage v. speed Rely on increasing number of computers & memory per computer (60% market) DIMM is replaceable unit computers use any generation DRAM Commodity, second source industry high volume, low profit, conservative Little organization innovation in 20 years Order of importance: 1) Cost/bit 2) Capacity First Rambus: 10x BW, +30% cost little impact Lecture 5-42

43 Embedded DRAM The inevitable: CPU & DRAM integration Embedded DRAM, Merged-DRAM-logic, intelligent RAM, Allows for high bandwidth Multiple wide busses, switched interconnect Allows for low latency Current set of problems Cost and capacity of single chip Alternatives MCM packaging 3D packaging Lecture 5-43

Embedded DRAM Example VIRAM media processor MIPS CPU DRAM DRAM DRAM DRAM 125M transistors 200MHz, 2 Watt I/O crossbar Embedded DRAM Multimedia CPU

44 Embedded DRAM Example VIRAM media processor MIPS CPU DRAM DRAM DRAM DRAM 125M transistors 200MHz, 2 Watt I/O crossbar Embedded DRAM Multimedia CPU 13 Mbytes 8 banks crossbar 6.4GB/sec per bank (peak) DRAM DRAM DRAM DRAM Processor 4-lane vector processor 6.4 Gop/sec 64-bit MIPS core Lecture 5-44

45 Non-volatile Memory (Flash) Storage technology Charge trapped in a floating gate Retains information even without power supply Two design alternatives NOR: used primarily for code better E/W endurance (100K vs 10K), fast reads (100ns), slow writes (10usec) NAND: used primarily for data Smaller cell (~40%), reads and writes are 1usec Applications MP3 players, cameras, Hard disk replacement Main memory replacement or assist? Lecture 5-45

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 19: Main Memory Prof. Onur Mutlu Carnegie Mellon University Last Time Multi-core issues in caching OS-based cache partitioning (using page coloring) Handling