An introduction to SDRAM and memory controllers. 5kk73

Similar documents
Memory Controllers for Real-Time Embedded Systems. Benny Akesson Czech Technical University in Prague

Trends in Embedded System Design

Topic 21: Memory Technology

Topic 21: Memory Technology

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

EEM 486: Computer Architecture. Lecture 9. Memory

CS698Y: Modern Memory Systems Lecture-16 (DRAM Timing Constraints) Biswabandan Panda

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

ECE 485/585 Microprocessor System Design

COMPUTER ARCHITECTURES

Introduction to memory system :from device to system

CSE502: Computer Architecture CSE 502: Computer Architecture

Memories: Memory Technology

,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

Internal Memory. Computer Architecture. Outline. Memory Hierarchy. Semiconductor Memory Types. Copyright 2000 N. AYDIN. All rights reserved.

CSE502: Computer Architecture CSE 502: Computer Architecture

COSC 6385 Computer Architecture - Memory Hierarchies (III)

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems

Computer Systems Laboratory Sungkyunkwan University

LECTURE 5: MEMORY HIERARCHY DESIGN

Memory. Lecture 22 CS301

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

EE414 Embedded Systems Ch 5. Memory Part 2/2

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Memory Access Pattern-Aware DRAM Performance Model for Multi-core Systems

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.

Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

CENG4480 Lecture 09: Memory 1

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

CENG3420 Lecture 08: Memory Organization

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

TECHNOLOGY BRIEF. Double Data Rate SDRAM: Fast Performance at an Economical Price EXECUTIVE SUMMARY C ONTENTS

Main Memory Systems. Department of Electrical Engineering Stanford University Lecture 5-1

Mainstream Computer System Components

ECE7995 (4) Basics of Memory Hierarchy. [Adapted from Mary Jane Irwin s slides (PSU)]

CpE 442. Memory System

CS311 Lecture 21: SRAM/DRAM/FLASH

The Memory Hierarchy 1

Copyright 2012, Elsevier Inc. All rights reserved.

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

CS152 Computer Architecture and Engineering Lecture 16: Memory System

Mark Redekopp, All rights reserved. EE 352 Unit 10. Memory System Overview SRAM vs. DRAM DMA & Endian-ness

Lecture 15: DRAM Main Memory Systems. Today: DRAM basics and innovations (Section 2.3)

DRAM Main Memory. Dual Inline Memory Module (DIMM)

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand

Lecture 18: DRAM Technologies

ECE 551 System on Chip Design

William Stallings Computer Organization and Architecture 6th Edition. Chapter 5 Internal Memory

Chapter 8 Memory Basics

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Organization. 5.1 Semiconductor Main Memory. William Stallings Computer Organization and Architecture 6th Edition

ISSN: [Bilani* et al.,7(2): February, 2018] Impact Factor: 5.164

The University of Adelaide, School of Computer Science 13 September 2018

Summer 2003 Lecture 18 07/09/03

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Where Have We Been? Ch. 6 Memory Technology

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Adapted from David Patterson s slides on graduate computer architecture

Lecture-14 (Memory Hierarchy) CS422-Spring

Recap: Machine Organization

Large and Fast: Exploiting Memory Hierarchy

Cycle Time for Non-pipelined & Pipelined processors

Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye

The Memory Component

Memory. Objectives. Introduction. 6.2 Types of Memory

Computer Structure. The Uncore. Computer Structure 2013 Uncore

Lecture: Memory Technology Innovations

Overview. Memory Classification Read-Only Memory (ROM) Random Access Memory (RAM) Functional Behavior of RAM. Implementing Static RAM

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now?

Computer Architecture Lecture 24: Memory Scheduling

Views of Memory. Real machines have limited amounts of memory. Programmer doesn t want to be bothered. 640KB? A few GB? (This laptop = 2GB)

Computer Organization. 8th Edition. Chapter 5 Internal Memory

Memory System Overview. DMA & Endian-ness. Technology. Architectural. Problem: The Memory Wall

Chapter 5 Internal Memory

2. Link and Memory Architectures and Technologies

A Comparative Study of Predictable DRAM Controllers

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Embedded Systems Design: A Unified Hardware/Software Introduction. Outline. Chapter 5 Memory. Introduction. Memory: basic concepts

Embedded Systems Design: A Unified Hardware/Software Introduction. Chapter 5 Memory. Outline. Introduction

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Organization Row Address Column Address Bank Address Auto Precharge 128Mx8 (1GB) based module A0-A13 A0-A9 BA0-BA2 A10

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Computer Architecture

Computer System Components

Semiconductor Memory Types Microprocessor Design & Organisation HCA2102

Variability Windows for Predictable DDR Controllers, A Technical Report

Addressing the Memory Wall

Computer Memory. Textbook: Chapter 1

Transcription:

An introduction to SDRAM and memory controllers 5kk73

Presentation Outline (part 1) Introduction to SDRAM Basic SDRAM operation Memory efficiency SDRAM controller architecture Conclusions Followed by part 2 2

Static RAM (SRAM) SRAM is typically on-chip memory Found in higher levels of the memory hierarchy Caches and scratchpads Either local to processor or centralized Local memory has very short access time Centralized shared memories have intermediate access time An SRAM cell consists of six transistors Limits memory to a few megabytes, or even smaller 3

4

Dynamic RAM (DRAM) DRAM was patented in 1968 by Robert Dennard at IBM Significantly cheaper than SRAM (power / area) 1 transistor and 1 capacitor vs. 6 transistors for SRAM A bit is represented by a high or low charge on the capacitor Charge dissipates due to leakage hence the term dynamic RAM Capacity of up to a gigabyte per chip DRAM is (shared) off-chip memory Long access time compared to SRAM Off-chip pins are expensive in terms of area and power SDRAM bandwidth is scarce and must be efficiently utilized Found in lower levels of memory hierarchy Used as remote high-volume storage 5

The DRAM evolution Evolution of the DRAM design in the past 16 years A clock signal was added making the design synchronous (SDRAM) The data bus transfers data on both rising and falling edge of the clock (hence Double Data Rate) Second and third generation of DDR memory (DDR2/DDR3) scales to higher clock frequencies (up to 1066 MHz) DDR4 is now standardized by JEDEC (up to 1200 MHz, 1600 MHz is planned) Special branches of DDR memories for graphic cards (GDDR) and for low-power systems (LPDDR, LPDDR2, LPDDR3) 6

Comparing the DRAM Evolution to the Dennard Evolution 7

SDRAM Architecture The SDRAM architecture is organized in banks, rows and columns A row buffer stores a currently active (open) row The memory interface has a command bus, address bus, and a data bus Buses shared between all banks to reduce the number of off-chip pins A bank is essentially is an independent memory, but with shared I/O Typical values DDR2/DDR3: 4 or 8 banks bank column Example memory: 16-bit DDR3-1600 64 MB 8K 65K rows / bank 1K 2K columns / row 4, 8, 16 bits / column 200-800 MHz 32 MB 1 GB density activate (open) row row buffer read write precharge (close) 8 banks 8K rows / bank 1024 columns / row 16 bits / column 3200 MB/s peak bandwidth 8

An Analogy An 8-bank SDRAM is like an 8-lane highway with a dead end and a single exit Shared by 2-way traffic! 9

Presentation Outline Introduction to SDRAM Basic SDRAM operation Memory efficiency SDRAM controller architecture Conclusions 10

Basic SDRAM Operation Requested row is activated and copied into the row buffer of the bank Read bursts and/or write bursts are issued to the active row Programmed burst length (BL) of 4 or 8 words Row is precharged and stored back into the memory array Command Abbr Description Activate ACT Activate a row in a particular bank bank column Read RD Initiate a read burst to an active row Write WR Initiate a write burst to an active row Precharge PRE Close a row in a particular bank Refresh REF Start a refresh operation No operation NOP Ignores all inputs 11 activate (open) row row buffer read write precharge (close)

Timing Constraints Timing constraints determine which commands can be scheduled More than 20 constraints, some are inter-dependent Limits the efficiency of memory accesses Wait for precharge, activate and read/write commands before data on bus Timing constraints get increasingly severe for faster memories The physical design of the memory core has not changed much Constaint in nanoseconds constant, but clock period gets shorter Parameter Abbr. Cycles ACT to RD/WR trcd 3 ACT to ACT (diff. banks) trrd 2 ACT to ACT (same bank) tras 12 Read latency trl 3 RD to RD - BL/2 12 December 8, 2014

Bank Parallelism Multiple banks provide parallelism SDRAM has separate data and command buses Activate, precharge and transfer data in parallel (bank preparation) Increases efficiency Figure shows parallel memory bursts with burst length 8 13 December 8, 2014

Presentation Outline Introduction to SDRAM Basic SDRAM operation Memory efficiency SDRAM controller architecture Conclusions 14

Memory Efficiency Memory efficiency is the fraction of clock cycles with data transfer Defines the exchange rate between peak bandwidth and net bandwidth Net bandwidth is the actual useful bandwidth after considering overhead Five categories of memory efficiency for SDRAM: Refresh efficiency Read/write efficiency Bank efficiency Command efficiency Data efficiency Memory efficiency is the product of these five categories 15

Refresh Efficiency SDRAM need to be refreshed regularly to retain data DRAM cell contains leaking capacitor Refresh command must be issued every 7.8 μs for DDR2/DDR3/DDR4 All banks must be precharged Data cannot be transfered during refresh Refresh efficiency is largely independent of traffic generally 95 99% 16

Read / Write Efficiency The data bus of an SDRAM is bi-directional Cycles are lost when switching direction of the data bus Extra NOPs must be inserted between read and write commands I should have inserted more NOPs! Read/write efficiency depends on traffic Determined by frequency of read/write switches Switching too often has a significant impact on memory efficiency Switching after every burst of 8 words gives 57% r/w efficiency with DDR2-400 How would you address this if you designed a memory controller? 17

Bank Efficiency Bank conflict when a read or write targets an inactive row (row miss) Significantly impacts memory efficiency Requires precharge followed by activate Less than 40% bank efficiency if always row miss in same bank Bank efficiency depends on traffic Determined by address of request and memory map How would you address this if you designed a memory controller? 18

Command Efficiency Command bus uses single data rate Congested if two commands are required simultaneously One command has to be delayed may delay data on bus Command efficiency depends on traffic Small bursts reduce command efficiency Potentially more activate and precharge commands issued Generally quite high (95-100%) 19

Be a Memory Command Scheduler! (Worst Halloween costume ever) Pen and paper may be useful For this SDRAM (loosely based on an LPDDR2-667 memory): Parameter Abbr. Cycles ACT to RD/WR (same bank) trcd 6 ACT to ACT (diff. banks) trrd 5 ACT to ACT (same bank) trc 20 RD to RD, or WR to WR - 4 RD to WR 8 WR to RD 8 RD to PRE (same bank) 5 WR to PRE (same bank)... 12 PRE to ACT (same bank) trp 5 Schedule these bursts: a) RD to bank 0, row 0, column 8 b) RD to bank 0, row 1, column 16 c) WR to bank 0, row 0, column 0 d) WR to bank 3, row 0, column 8 e) WR to bank 0, row 1, column 32 f) RD to bank 0, row 0, column 0 You can reorder all you like Optimize for the total schedule length 20

A possible solution Schedule these bursts: a) RD to bank 0, row 0, column 8 b) RD to bank 0, row 1, column 16 c) WR to bank 0, row 0, column 0 d) WR to bank 3, row 0, column 8 e) WR to bank 0, row 1, column 32 f) RD to bank 0, row 0, column 0 Parameter Abbr. Cycles ACT to RD/WR (same bank) trcd 6 ACT to ACT (diff. banks) trrd 5 ACT to ACT (same bank) trc 20 RD to RD, or WR to WR - 4 RD to WR 8 WR to RD 8 RD to PRE (same bank) 5 WR to PRE (same bank)... 12 PRE to ACT (same bank) trp 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 CMD Row /col A0 A3 W0 W3 0 0 0 8 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 3 CMD Row /col R0 R0 P0 A0 8 0 1 39 40 41 42 43 44 45 46 47 48 49 50 51 CMD Row /col W0 R0 32 16

A possible solution Useful techniques: Grouping reads and writes Pipelining operations to different banks Overlapping timing constraints (wait for both WR-to-RD constraint and WR-to-PRE at the same time for example) (Reordering is not always an option in reality) Further down this lecture these concepts will be revisited We build hardware to schedule these commands normally 22

Data Efficiency A memory burst cannot access segments of the minimum burst length Minimum access granularity Burst length 8 words is 16 B with 16-bit memory and 64 B with 64-bit memory Excess data is thrown away! If data is poorly aligned an extra segment has to be transferred Cycles are lost when transferring unrequested data Data efficiency depends on the memory client ( and the application ) Smaller requests reduce data efficiency 23

Conclusions on Memory Efficiency Memory efficiency is highly dependent on traffic Worst-case efficiency is very low Every burst targets different rows in the same bank Read/write switch after every burst Results in Less than 31% efficiency for all DDR2/DDR3/LPDDR/LPDDR2 memories Efficiency drops as memories become faster (DDR4) 24

Worst-case memory efficiency 1 0.9 0.8 0.7 0.6 0.5 0.4 128MB_DDR2-400 128MB_DDR2-800 128MB_DDR3-800 128MB_DDR3-1600 128MB_LPDDR-266 128MB_LPDDR-400 256MB_LPDDR-416 256MB_LPDDR2-667-S4 256MB_LPDDR2-800-S4 256MB_LPDDR2-1066-S4 DDRX-Y runs at Y/2 MHz command rate Transports Y memory words per cycle 0.3 0.2 0.1 0 Conclusion Worst-case efficiency must be avoided! (And what is wrong with this picture?) 25

Presentation Outline Introduction to SDRAM Basic SDRAM operation Memory efficiency SDRAM controller architecture Conclusions 26

A general memory controller architecture A general controller architecture consists of two parts The front-end buffers requests and responses per requestor schedules one (or more) requests for memory access is independent of the memory type The back-end translates scheduled request(s) into SDRAM command sequence is dependent on the memory type 27

Front-end arbitration Front-end provides buffering and arbitration Arbiter can schedule requests in many different ways Priorities are common to give low-latency access to critical requestors E.g. stalling processor waiting for a cache line However, it is important to prevent starvation of low priority requestors Common to schedule fairly in case of multiple processors (round-robin, TDM) Next request may be scheduled before previous is finished Gives more options to command generator in back-end Scheduled requests are sent to the back-end for memory access 28

Back-end Back-end contains a memory map and a command generator Memory map decodes logical address to physical address Physical address is (bank, row, column) Can be done in different ways choice affects efficiency Logical addr. Memory map Physical addr. 0x10FF00 (2, 510, 128) Command generator schedules commands for the target memory Customized for a particular memory generation Programmable to handle different timing constraints 29

Programmable Memory Timings 30

Programmable Memory Timings The SDRAM DIMM generally knows which timings it likes: 31

Memory map 32

Command generator Generates and schedules commands for scheduled requests May work with both requests and commands Many ways to determine which request to process Increase bank efficiency Prefer requests targeting open rows Increase read/write efficiency Prefer read after read and write after write Reduce stall cycles of processor Always prefer reads, since reads are blocking and writes are often posted 36 What are the pros and cons of these methods? What happens to the worst-case?

Command generator Generate SDRAM commands without violating timing constraints Often build hierarchically: distribute request across banks, and issue commands once timing constraints are satisfied. Then choose which command for which bank is executed. Many possible policies to determine which command to schedule Page policies Close rows as soon as possible to activate new one faster Keep rows open as long as possible to benefit from locality Command priorities Read and write commands have high priority, as they put data on the bus Precharge and activate commands have lower priorities Algorithms often try to put data on the bus as soon as possible 37

Presentation Outline Introduction to SDRAM Basic SDRAM operation Memory efficiency SDRAM controller architecture Conclusions 38

Conclusions SDRAM is used as shared off-chip high-volume storage Cheaper but slower than SRAM The worst-case efficiency of SDRAM depends on many factors 1 0.9 0.8 0.7 0.6 0.5 128MB_DDR2-400 128MB_DDR2-800 128MB_DDR3-800 128MB_DDR3-1600 128MB_LPDDR-266 128MB_LPDDR-400 256MB_LPDDR-416 256MB_LPDDR2-667-S4 256MB_LPDDR2-800-S4 256MB_LPDDR2-1066-S4 0.4 0.3 0.2 0.1 0 Actual case is highly variable and depends on the application Controller tries to minimize latency and maximize efficiency Low-latency for critical requestors using priorities Fairness among multiple processors High efficiency by reordering requests to fit with memory state Memory map impacts efficiency and power 39

Presentation Outline (part 2) Mixed time-criticality 41

Trends in embedded systems Embedded systems get increasingly complex Increasingly complex applications (more functionality) Growing number of applications integrated in a device More applications execute concurrently Requires increased system performance without increasing power The resulting complex contemporary platforms are heterogeneous multi-processor systems with distributed memory hierarchy to improve performance/power ratio Resources in the system are shared to reduce cost 42

Mixed time-criticality Applications have mixed time-criticality Firm real-time requirements (FRT) E.g. software-defined radio application Failure to satisfy requirement may violate correctness No deadline misses tolerable Soft real-time requirements (SRT) E.g. media decoder application Failure to satisfy requirement reduces quality of output Occassional deadline misses tolerable No real-time requirements (NRT) E.g. graphical user interface No timing requirements, but must be responsive 43

Formal verification Verifying MRT systems requires a combination of methods Formal verification Simulation-based verification Formal verification is often used to verify FRT requirements Provides analytical bounds on response time or throughput Considers all application inputs Covers all combinations of concurrently running applications Approach requires models of both applications and hardware Application models are not always available Behavior of dynamic applications is not captured accurately Most hardware is not designed with formal analysis in mind 44

Simulation Based verification Simulation is typically used to verify SRT and NRT applications System simulated with a large set of inputs Resource sharing results in interference between applications Timing behaviors of applications in use-case inter-dependent All use-cases must be verified instead of all applications Verification must be repeated if applications are added or modified Verification by simulation is a slow process with poor coverage Verification is costly and effort is expected to increase in future! 45

Performance guarantees for SDRAM SDRAM memories are particularly challenging resources The execution time of a request in an SDRAM is variable WCET is pessimistic and guaranteed bandwidth is very low Less than 16% bandwidth can be guaranteed for all DDR3 devices 1 0.8 0.6 0.4 128MB_DDR2-400 128MB_DDR2-800 128MB_DDR3-800 128MB_DDR3-1600 128MB_LPDDR-266 128MB_LPDDR-400 256MB_LPDDR-416 256MB_LPDDR2-667-S4 256MB_LPDDR2-800-S4 256MB_LPDDR2-1066-S4 0.2 0 SDRAM bandwidth is scarce and must be efficiently utilized Additional interfaces cannot be added due to cost constraints 46

Problem statement Complex systems have mixed time-criticality Firm, soft, and no real-time requirements in one system We refer to this as mixed real-time (MRT) requirements Sharing an SDRAM controllers between FRT and SRT/NRT applications is challenging We would like to use the SDRAM in an efficient and power conscious manner Satisfying the FRT requirements, while providing sufficient performance to the SRT/NRT applications 47