An introduction to SDRAM and memory controllers. 5kk73

An introduction to SDRAM and memory controllers 5kk73

Presentation Outline (part 1) Introduction to SDRAM Basic SDRAM operation Memory efficiency SDRAM controller architecture Conclusions Followed by part 2 2

Static RAM (SRAM) SRAM is typically on-chip memory Found in higher levels of the memory hierarchy Caches and scratchpads Either local to processor or centralized Local memory has very short access time Centralized shared memories have intermediate access time An SRAM cell consists of six transistors Limits memory to a few megabytes, or even smaller 3

Dynamic RAM (DRAM) DRAM was patented in 1968 by Robert Dennard at IBM Significantly cheaper than SRAM (power / area) 1 transistor and 1 capacitor vs. 6 transistors for SRAM A bit is represented by a high or low charge on the capacitor Charge dissipates due to leakage hence the term dynamic RAM Capacity of up to a gigabyte per chip DRAM is (shared) off-chip memory Long access time compared to SRAM Off-chip pins are expensive in terms of area and power SDRAM bandwidth is scarce and must be efficiently utilized Found in lower levels of memory hierarchy Used as remote high-volume storage 5

The DRAM evolution Evolution of the DRAM design in the past 16 years A clock signal was added making the design synchronous (SDRAM) The data bus transfers data on both rising and falling edge of the clock (hence Double Data Rate) Second and third generation of DDR memory (DDR2/DDR3) scales to higher clock frequencies (up to 1066 MHz) DDR4 is now standardized by JEDEC (up to 1200 MHz, 1600 MHz is planned) Special branches of DDR memories for graphic cards (GDDR) and for low-power systems (LPDDR, LPDDR2, LPDDR3) 6

Comparing the DRAM Evolution to the Dennard Evolution 7

SDRAM Architecture The SDRAM architecture is organized in banks, rows and columns A row buffer stores a currently active (open) row The memory interface has a command bus, address bus, and a data bus Buses shared between all banks to reduce the number of off-chip pins A bank is essentially is an independent memory, but with shared I/O Typical values DDR2/DDR3: 4 or 8 banks bank column Example memory: 16-bit DDR3-1600 64 MB 8K 65K rows / bank 1K 2K columns / row 4, 8, 16 bits / column 200-800 MHz 32 MB 1 GB density activate (open) row row buffer read write precharge (close) 8 banks 8K rows / bank 1024 columns / row 16 bits / column 3200 MB/s peak bandwidth 8

An Analogy An 8-bank SDRAM is like an 8-lane highway with a dead end and a single exit Shared by 2-way traffic! 9

Presentation Outline Introduction to SDRAM Basic SDRAM operation Memory efficiency SDRAM controller architecture Conclusions 10

Basic SDRAM Operation Requested row is activated and copied into the row buffer of the bank Read bursts and/or write bursts are issued to the active row Programmed burst length (BL) of 4 or 8 words Row is precharged and stored back into the memory array Command Abbr Description Activate ACT Activate a row in a particular bank bank column Read RD Initiate a read burst to an active row Write WR Initiate a write burst to an active row Precharge PRE Close a row in a particular bank Refresh REF Start a refresh operation No operation NOP Ignores all inputs 11 activate (open) row row buffer read write precharge (close)

Timing Constraints Timing constraints determine which commands can be scheduled More than 20 constraints, some are inter-dependent Limits the efficiency of memory accesses Wait for precharge, activate and read/write commands before data on bus Timing constraints get increasingly severe for faster memories The physical design of the memory core has not changed much Constaint in nanoseconds constant, but clock period gets shorter Parameter Abbr. Cycles ACT to RD/WR trcd 3 ACT to ACT (diff. banks) trrd 2 ACT to ACT (same bank) tras 12 Read latency trl 3 RD to RD - BL/2 12 December 8, 2014

Bank Parallelism Multiple banks provide parallelism SDRAM has separate data and command buses Activate, precharge and transfer data in parallel (bank preparation) Increases efficiency Figure shows parallel memory bursts with burst length 8 13 December 8, 2014

Presentation Outline Introduction to SDRAM Basic SDRAM operation Memory efficiency SDRAM controller architecture Conclusions 14

Memory Efficiency Memory efficiency is the fraction of clock cycles with data transfer Defines the exchange rate between peak bandwidth and net bandwidth Net bandwidth is the actual useful bandwidth after considering overhead Five categories of memory efficiency for SDRAM: Refresh efficiency Read/write efficiency Bank efficiency Command efficiency Data efficiency Memory efficiency is the product of these five categories 15

Refresh Efficiency SDRAM need to be refreshed regularly to retain data DRAM cell contains leaking capacitor Refresh command must be issued every 7.8 μs for DDR2/DDR3/DDR4 All banks must be precharged Data cannot be transfered during refresh Refresh efficiency is largely independent of traffic generally 95 99% 16

Read / Write Efficiency The data bus of an SDRAM is bi-directional Cycles are lost when switching direction of the data bus Extra NOPs must be inserted between read and write commands I should have inserted more NOPs! Read/write efficiency depends on traffic Determined by frequency of read/write switches Switching too often has a significant impact on memory efficiency Switching after every burst of 8 words gives 57% r/w efficiency with DDR2-400 How would you address this if you designed a memory controller? 17

Bank Efficiency Bank conflict when a read or write targets an inactive row (row miss) Significantly impacts memory efficiency Requires precharge followed by activate Less than 40% bank efficiency if always row miss in same bank Bank efficiency depends on traffic Determined by address of request and memory map How would you address this if you designed a memory controller? 18

Command Efficiency Command bus uses single data rate Congested if two commands are required simultaneously One command has to be delayed may delay data on bus Command efficiency depends on traffic Small bursts reduce command efficiency Potentially more activate and precharge commands issued Generally quite high (95-100%) 19

Be a Memory Command Scheduler! (Worst Halloween costume ever) Pen and paper may be useful For this SDRAM (loosely based on an LPDDR2-667 memory): Parameter Abbr. Cycles ACT to RD/WR (same bank) trcd 6 ACT to ACT (diff. banks) trrd 5 ACT to ACT (same bank) trc 20 RD to RD, or WR to WR - 4 RD to WR 8 WR to RD 8 RD to PRE (same bank) 5 WR to PRE (same bank)... 12 PRE to ACT (same bank) trp 5 Schedule these bursts: a) RD to bank 0, row 0, column 8 b) RD to bank 0, row 1, column 16 c) WR to bank 0, row 0, column 0 d) WR to bank 3, row 0, column 8 e) WR to bank 0, row 1, column 32 f) RD to bank 0, row 0, column 0 You can reorder all you like Optimize for the total schedule length 20

A possible solution Schedule these bursts: a) RD to bank 0, row 0, column 8 b) RD to bank 0, row 1, column 16 c) WR to bank 0, row 0, column 0 d) WR to bank 3, row 0, column 8 e) WR to bank 0, row 1, column 32 f) RD to bank 0, row 0, column 0 Parameter Abbr. Cycles ACT to RD/WR (same bank) trcd 6 ACT to ACT (diff. banks) trrd 5 ACT to ACT (same bank) trc 20 RD to RD, or WR to WR - 4 RD to WR 8 WR to RD 8 RD to PRE (same bank) 5 WR to PRE (same bank)... 12 PRE to ACT (same bank) trp 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 CMD Row /col A0 A3 W0 W3 0 0 0 8 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 3 CMD Row /col R0 R0 P0 A0 8 0 1 39 40 41 42 43 44 45 46 47 48 49 50 51 CMD Row /col W0 R0 32 16

A possible solution Useful techniques: Grouping reads and writes Pipelining operations to different banks Overlapping timing constraints (wait for both WR-to-RD constraint and WR-to-PRE at the same time for example) (Reordering is not always an option in reality) Further down this lecture these concepts will be revisited We build hardware to schedule these commands normally 22

Data Efficiency A memory burst cannot access segments of the minimum burst length Minimum access granularity Burst length 8 words is 16 B with 16-bit memory and 64 B with 64-bit memory Excess data is thrown away! If data is poorly aligned an extra segment has to be transferred Cycles are lost when transferring unrequested data Data efficiency depends on the memory client ( and the application ) Smaller requests reduce data efficiency 23

Conclusions on Memory Efficiency Memory efficiency is highly dependent on traffic Worst-case efficiency is very low Every burst targets different rows in the same bank Read/write switch after every burst Results in Less than 31% efficiency for all DDR2/DDR3/LPDDR/LPDDR2 memories Efficiency drops as memories become faster (DDR4) 24

Worst-case memory efficiency 1 0.9 0.8 0.7 0.6 0.5 0.4 128MB_DDR2-400 128MB_DDR2-800 128MB_DDR3-800 128MB_DDR3-1600 128MB_LPDDR-266 128MB_LPDDR-400 256MB_LPDDR-416 256MB_LPDDR2-667-S4 256MB_LPDDR2-800-S4 256MB_LPDDR2-1066-S4 DDRX-Y runs at Y/2 MHz command rate Transports Y memory words per cycle 0.3 0.2 0.1 0 Conclusion Worst-case efficiency must be avoided! (And what is wrong with this picture?) 25

Presentation Outline Introduction to SDRAM Basic SDRAM operation Memory efficiency SDRAM controller architecture Conclusions 26

A general memory controller architecture A general controller architecture consists of two parts The front-end buffers requests and responses per requestor schedules one (or more) requests for memory access is independent of the memory type The back-end translates scheduled request(s) into SDRAM command sequence is dependent on the memory type 27

Front-end arbitration Front-end provides buffering and arbitration Arbiter can schedule requests in many different ways Priorities are common to give low-latency access to critical requestors E.g. stalling processor waiting for a cache line However, it is important to prevent starvation of low priority requestors Common to schedule fairly in case of multiple processors (round-robin, TDM) Next request may be scheduled before previous is finished Gives more options to command generator in back-end Scheduled requests are sent to the back-end for memory access 28

Back-end Back-end contains a memory map and a command generator Memory map decodes logical address to physical address Physical address is (bank, row, column) Can be done in different ways choice affects efficiency Logical addr. Memory map Physical addr. 0x10FF00 (2, 510, 128) Command generator schedules commands for the target memory Customized for a particular memory generation Programmable to handle different timing constraints 29

Programmable Memory Timings 30

Programmable Memory Timings The SDRAM DIMM generally knows which timings it likes: 31

Memory map 32

Command generator Generates and schedules commands for scheduled requests May work with both requests and commands Many ways to determine which request to process Increase bank efficiency Prefer requests targeting open rows Increase read/write efficiency Prefer read after read and write after write Reduce stall cycles of processor Always prefer reads, since reads are blocking and writes are often posted 36 What are the pros and cons of these methods? What happens to the worst-case?

Command generator Generate SDRAM commands without violating timing constraints Often build hierarchically: distribute request across banks, and issue commands once timing constraints are satisfied. Then choose which command for which bank is executed. Many possible policies to determine which command to schedule Page policies Close rows as soon as possible to activate new one faster Keep rows open as long as possible to benefit from locality Command priorities Read and write commands have high priority, as they put data on the bus Precharge and activate commands have lower priorities Algorithms often try to put data on the bus as soon as possible 37

Presentation Outline Introduction to SDRAM Basic SDRAM operation Memory efficiency SDRAM controller architecture Conclusions 38

Conclusions SDRAM is used as shared off-chip high-volume storage Cheaper but slower than SRAM The worst-case efficiency of SDRAM depends on many factors 1 0.9 0.8 0.7 0.6 0.5 128MB_DDR2-400 128MB_DDR2-800 128MB_DDR3-800 128MB_DDR3-1600 128MB_LPDDR-266 128MB_LPDDR-400 256MB_LPDDR-416 256MB_LPDDR2-667-S4 256MB_LPDDR2-800-S4 256MB_LPDDR2-1066-S4 0.4 0.3 0.2 0.1 0 Actual case is highly variable and depends on the application Controller tries to minimize latency and maximize efficiency Low-latency for critical requestors using priorities Fairness among multiple processors High efficiency by reordering requests to fit with memory state Memory map impacts efficiency and power 39

Presentation Outline (part 2) Mixed time-criticality 41

Trends in embedded systems Embedded systems get increasingly complex Increasingly complex applications (more functionality) Growing number of applications integrated in a device More applications execute concurrently Requires increased system performance without increasing power The resulting complex contemporary platforms are heterogeneous multi-processor systems with distributed memory hierarchy to improve performance/power ratio Resources in the system are shared to reduce cost 42

Mixed time-criticality Applications have mixed time-criticality Firm real-time requirements (FRT) E.g. software-defined radio application Failure to satisfy requirement may violate correctness No deadline misses tolerable Soft real-time requirements (SRT) E.g. media decoder application Failure to satisfy requirement reduces quality of output Occassional deadline misses tolerable No real-time requirements (NRT) E.g. graphical user interface No timing requirements, but must be responsive 43

Formal verification Verifying MRT systems requires a combination of methods Formal verification Simulation-based verification Formal verification is often used to verify FRT requirements Provides analytical bounds on response time or throughput Considers all application inputs Covers all combinations of concurrently running applications Approach requires models of both applications and hardware Application models are not always available Behavior of dynamic applications is not captured accurately Most hardware is not designed with formal analysis in mind 44

Simulation Based verification Simulation is typically used to verify SRT and NRT applications System simulated with a large set of inputs Resource sharing results in interference between applications Timing behaviors of applications in use-case inter-dependent All use-cases must be verified instead of all applications Verification must be repeated if applications are added or modified Verification by simulation is a slow process with poor coverage Verification is costly and effort is expected to increase in future! 45

Performance guarantees for SDRAM SDRAM memories are particularly challenging resources The execution time of a request in an SDRAM is variable WCET is pessimistic and guaranteed bandwidth is very low Less than 16% bandwidth can be guaranteed for all DDR3 devices 1 0.8 0.6 0.4 128MB_DDR2-400 128MB_DDR2-800 128MB_DDR3-800 128MB_DDR3-1600 128MB_LPDDR-266 128MB_LPDDR-400 256MB_LPDDR-416 256MB_LPDDR2-667-S4 256MB_LPDDR2-800-S4 256MB_LPDDR2-1066-S4 0.2 0 SDRAM bandwidth is scarce and must be efficiently utilized Additional interfaces cannot be added due to cost constraints 46

Problem statement Complex systems have mixed time-criticality Firm, soft, and no real-time requirements in one system We refer to this as mixed real-time (MRT) requirements Sharing an SDRAM controllers between FRT and SRT/NRT applications is challenging We would like to use the SDRAM in an efficient and power conscious manner Satisfying the FRT requirements, while providing sufficient performance to the SRT/NRT applications 47