DDR4: Designing for Power and Performance
Agenda Comparison between DDR3 and DDR4 Designing for power DDR4 power savings Designing for performance Creating a data valid window Good layout practices for DDR4 Board debug tools to minimize issues Looking ahead and conclusion 2
3 Comparison Between DDR3 and DDR4
DRAM Technology Comparison DDR3 DDR4 GDDR5 Voltage 1.5 V / 1.35 V 1.2 V 1.5 V / 1.35 V Strobe Bi-directional differential Bi-directional differential Free-running differential WRITE clock Strobe Configuration Per byte Per byte Per word READ Data Capture Strobe based Strobe based Clock data recovery Data Termination VDDQ/2 VDDQ VDDQ Address/Command Termination VDDQ/2 VDDQ/2 VDDQ Burst Length BC4, 8 BC4, 8 8 Bank Grouping No 4 4 On-Chip Error Detection No Command / address parity CRC for data bus CRC for data bus Configuration x4, x8, x16 x4, x8, x16 x16, x32 Package 78-ball / 96-ball FBGA 78-ball / 96-ball FBGA 170-ball FBGA Data Rate (Mbps/Pin) 800 2,133 1,600 3,200+ 4,000 7,000 Component Density 1 GB 8 GB 2 GB 16 GB 512 MB 2 GB Stacking Options DDP, QDP Up to 8H (128-GB stack); single load No 4
5 DDR4 Power Savings
DDR4 Power Savings Features DDR4 voltage is 1.2 V (up to 40% savings) Lower voltage than DDR3 (1.5 V) On-die VREF Pseudo-open drain I/Os Manages refreshes (up to 20% savings) Based on temperature New DDR4 low-power auto self-refresh (LPASR) capability Changes refresh rate based on temperature Only refreshes parts of array that is in use Controller must allow fine-granularity refresh based on memory utilization Supports data bus inversion Limits number of signals transitioning, reducing simultaneous switching output (SSO) and saving power 6
7 Creating a Data Valid Window
Timing Margins Are Shrinking Shrinking Timing Margins in Picoseconds DRAM Margin Package/board / Board Margin Chip Margin Data Valid Window 2,500 Data Valid Window DRAM Margin Package/ Board Margin Chip Margin DDR1 2,500 900 800 800 DDR2 938 425 256 256 DDR3 469 188 140 140 DDR4 313 125 93 93 938 469 313 DDR1 DDR2 DDR3 DDR4 400 Mbps 3,200 Mbps 8
Shrinking the Window Even More: DDR4 VREF Training (1/2) DDR4 VREF training Training: sweep VREF setting, find maximum passing window Lump sum of DCD, RX offset, etc. Resolution error is the combination of (VREF, PI, or delay chain) Margin loss calculation VREF step size: from 0.5% VDDQ to 0.8% VDDQ VREF set tolerance: 1.625% or 0.15% Calibration error: 1 step size 0.8% * VDDQ = 0.8% * 1.2V = 9.6 mv Margin loss (due to VREF calibration error) 9.6 mv * 2 / slew_rate = 4.8 ps (assume slew rate = 4 V/ns) Calibration error = half step size Vref Step Size Vref step 0.50% 0.65% 0.80% VDDQ 2 Vref Set Tolerance Vref_set_tol -1.625% 0.00% 1.625% VDDQ 3, 4, 6-0.15% 0.00% 0.15% VDDQ 3, 5, 7 10
Shrinking the Window Even More: DDR4 VREF Training (2/2) Discussion with JEDEC members RDDR4 specification section 13.4: any DRAM component level variation must be accounted for within the DRAM RX mask. This means that the VREF calibration error is included in VdlVW_total. VREF_DQ internal aligns to VCENT_DQs with training. VCENT_DQs has variation. VREF_DQ training error should increase with this variation and internal voltage noise etc. 11
Shrinking the Window Even More: Duty Cycle Error DDR4 specification is +/-2% tck = +/- 0.04 UI IPD current budget +/-3% tck Margin loss is 4% tck With proper link timing calibration 2% tck margin loss Assume same for read +/-2% +/-2% DQS DQ Timing Parameters by Speed Bin for DDR4-2400 to DDR4-3200 Clock Timing Speed DDR4-2400 DDR4-2666 DDR4-3200 Parameter Symbol MIN MAX MIN MAX MIN MAX Units NOTE Minimum Clock Cycle Time (DLL Off Mode) tck (DLL_OFF) 8-8 - 8 - nδ 22 Average Clock Period tck (avg) TBD pδ Average High Pulse Width tch (avg) 0.48 0.52 0.48 0.52 0.48 0.52 tck (avg) Average Low Pulse Width tcl (avg) 0.48 0.52 0.48 0.52 0.48 0.52 tck (avg) 12
Shrinking the Window Even More: Calculating the PLL Jitter Current Profile : I(f) PDN Impedance : Z(f) Jitter Sensitivity : S(f) PSRR of PLL: P(f) f f f f Jitter Spectrum J(f) TIE Jitter : j(t) ifft p-p jitter f t ifft I( f ) Z( f ) S( f ) P( f ) = J ( f ) j ( t) TIE 13
DDR4 Bank Group Timing Different timing within a group and between groups (tccd, twtr, trrd) Long timing: bank-to-bank within a group Short timing: access to different bank groups Maintain array timing requirements within bank group Maintain speed between different bank groups Bank 2 Bank 3 Bank 2 Bank 3 Bank Group 0 Bank Group 1 Bank 0 Bank 1 Bank 0 Bank 1 Bank 2 Bank 3 Short Timings Long Timings Bank 2 Bank 3 Bank Group 2 Bank 2 Bank 3 Bank Group 3 Bank 0 Bank 1 Bank 0 Bank 1 Bank 0 Bank 1 Bank Group 1 14
Calibration Is Critical to Shrinking Margins 0.5 0.4 0.3 FPGA Effects External Effects Calibration Effects Calibration Uncertainty Margin (ns) 0.2 0.1 0 No Margin Without Calibration -0.1 15
What is Calibration? Capture Calibration (De-skew) Before de-skew small valid capture window DQs 0 15 30 45 60 75 90 105 120 135 150 165 180 DQ0 DQ1 DQ2 DQ3 DQ4 DQ5 DQ6 DQ7 Benefit: Reduce skew between data group More capture margin After de-skew maximize valid capture window DQ0 DQ1 DQ2 DQ3 DQ4 DQ5 DQs 0 15 30 45 60 75 90 105 120 135 150 165 180 Resync Calibration Benefit: Accurate strobe placement More resync margin DQ0 DQ1 DQ2 DQ3 * * DQ70 DQ71 0 15 30 45 60 315 330 345 360 Valid data window VT Compensation Data shifts due to VT variations Voltage and temperature tracking Benefit: Dynamic phase adjustment to match shifting data valid window Robust over VT 16
High-Level Output Topology CLK DQS OUT1 Delay DQS OUT2 Delay DQS ptap control X+90 phase X phase DQS out dtap1 control DQ OUT1 Delay DQS out dtap2 control DQ OUT2 Delay DQ Calibration knobs DQ out dtap1 control DQ out dtap2 control DQ-out1 and DQ-out2 delay : Control the delay applied to outgoing DQ pins DQS-out1 and DQS-out2 delay : Control the delay applied to outgoing DQS pins Write leveling output : Changes the delay on both DQ and DQS relative to the memory clock-in phase taps 17
High-Level Input Topology vfifo control dqs_en ptap control DQS en dtap control VFIFO X phase DQS En Delay DQS DDIOin DQS Enable DQS IN Delay DQS Delay Chain LFIFO DQS in dtap control DQ Lfifo control DQ IN Delay Calibration knobs DQ in dtap control DQ-in delay: Control the delay applied to incoming DQ pins DQS-in delay: Control the delay applied to incoming DQS pins LFIFO : Controls number of cycles after read command that data is read out of the LFIFO DQS-En phase: Control the delay on DQS En in phase taps DQS-En delay: Control the delay on DQS En in dtaps VIFO : Adjusts the delay in cycles applied to controller-provided DQS burst signal to generate DQS enable 18
Calibration Stages DQS-enable calibration Calibrate DQS enable (delayed read data valid) relative to DQS Post-amble tracking Track DQS-enable across temperature variation Read data deskew Calibrate DQS relative to read command (read leveling) Calibrate DQ versus DQS (per-bit deskew) for reads LFIFO training Calibrate LFIFO delay cycles (read latency) Write leveling Calibrate DQS and DM to write command (write leveling) Write data deskew Calibrate DQ versus DQS (per-bit deskew) for writes Address/command training (leveling and deskew) Calibrate CS, CAS, RAS, and ODT versus memory clock VREF training (FPGA and memory) Calibrates receiver voltage threshold (for DDR4 with pseudo open drain DQs) Start Wait for PLL/DLL locking Initialize INST/AC ROM for all pins on this Mem Interface Initialize the memory (Mode Registers etc.) Calibrate the Mem Interface All Mem Interfaces calibrated? Y User command found in DPRIO? N User command found in RAM? N Y Y N Process DPRIO user command Process RAM user command Calibration loop User mode loop 19
Calibration Is Critical to Shrinking Margins 0.5 0.4 0.3 FPGA Effects External Effects Calibration Effects Calibration Uncertainty Margin (ns) 0.2 0.1 0 No Margin Without Calibration -0.1 20
21 Good Layout Practices for DDR4
DDR4 Output Driver DDR3 Push-Pull DDR4 Pseudo Open Drain 22 Content Courtesy of Micron
Unadjusted, Non-Terminated Data Eye VDD Overshoot VSS Undershoot Jitter 23 Content Courtesy of Micron
Terminated Data Eye Overshoot VIHac VIHdc Hi-Ringback Lo-Ringback Vref VILac VILdc Undershoot 24 Content Courtesy of Micron
OCT from the Controller Standpoint DQ and CA pins are terminated differently in DDR4 Interface Specification DDR3 DDR4 Density / Speed Voltage (VDD / VDDQ / VPP) 512 Mb ~ 8 GB 1.6 ~ 2.1 Gbps 1.5 V / 1.5 V / NA (1.35 V / 1.35 V / NA) 2 GB ~ 16 GB 1.6 ~ 3.2 Gbps 1.2 V / 1.2 V / 2.5 V VREF External VREF (VDD / 2) Internal VREF (need training) Data I/Os CTT (34 ohm) POD (34 ohm) CMD/ADDR I/Os CTT CTT Strobe Bi-directional / differential Bi-directional / differential Number of banks 8 16 (4 GB) Core Architect Physical Page size (x4 / x8 / x16) 1 KB / 1 KB / 2 KB 512 B / 1 KB / 2 KB Number of prefetch 8 bits 8 bits Added function RESET / ZQ / Dynamic ODT + CRC / DBI / Multi preamble Package type / balls (x4, x8 / x16) 78 / 96 BGA 78 / 96 BGA DIMM type R, LR, U, SoDIMM + ECC SoDIMM DIMM pins 240 (R, LR, U) / 204 (So) 284 (R, LR, U) / 256 (So) 25
OCT Calibration Scheme to Support DDR4 OCT can calibrate 2 times with 2 sets of pins (DQ/CA) DQ and CA pins will have 2 different sets of codes in DDR4 DDR4 DDR3 26
General Layout Concerns Avoid crossing splits in the power plane SSO on controller collapsed strobes/clocks Separate supplies and/or flip-chip packaging helps Low-pass VREF filtering on controller helps Minimize VREF noise Minimize intersymbol interference (ISI) Minimize crosstalk 27 Content Courtesy of Micron
Layout and Termination (1/12) Signal integrity review Importance of transmission line theory Today s clock rates are too fast to ignore Matched impedance line is important for good signaling Mismatched impedance lines result in reflections Termination schemes are used to reduce / eliminate reflections Good power bussing is paramount to reducing SSO SSO reduce voltage and timing margins Decoupling capacitors needs and requirements 28 Content Courtesy of Micron
Layout and Termination (2/12) Signal integrity analysis is paramount to developing cost-effective high-speed memory systems Develop timing budget for proof of concept Use models to simulate Board skews are important and should accounted for ISI, crosstalk, VREF noise, path length matching, Cin and RTT mismatch employ industry practices and assumptions Model vias too Eliminate return path discontinuities (RPDs) Minimize SSO affects Difficult to model 29 Content Courtesy of Micron
Layout and Termination (3/12) DRAM and controller package parasitics are fixed SSO effects already contained in their specified timings However, these are to test conditions with specific decoupling Power delivery network (PDN) for the controller and DRAM need to be properly designed Lowering power supply inductance minimizes signaling variations between devices Use power and ground planes wherever possible Make all power and ground traces as fat as possible Couple power and ground as much as possible Lowers inductance (mutual effects) 30 Content Courtesy of Micron
Layout and Termination (4/12) SSO Timing and noise issues generated due to rapid changes in voltage and current caused by multiple circuits switching simultaneously in the same direction Problems caused by SSO False triggers due to power/ground bounce Reduced timing margin due to SSO induced skew Reduced voltage margin due to power/ground noise Slew rate variation 31 Content Courtesy of Micron
Layout and Termination (5/12) Good power bussing is paramount to reducing SSO Reduce L (power delivery effective inductance) Use planes for power and ground distribution Proper routing of power and ground traces to devices Proper use of decoupling capacitance Locate as close as possible to the component pins Reduce di/dt (switching current slew rate) V = L di dt Use the slowest drive edge that will work Use reduced drive strength instead of full drive where possible 32 Content Courtesy of Micron
Layout and Termination (6/12) RPDs induce board noise and are difficult to model Splits/holes in reference planes Connector discontinuities Layer changes Avoid RPDs if at all possible Avoid crossing holes/splits in reference plane Route signals so they reference the proper domain Add power/ground vias to board Split Return Path Especially in dense layer-change areas Place decoupling capacitors near connectors Solid Return Path 33 Content Courtesy of Micron
Layout and Termination (7/12) VREF noise Induces strobe to data skews and reduces voltage margins Power/ground plane noise Crosstalk Minimize VREF noise Use widest trace practical to route From chip to decoupling capacitor Use large spacing between VREF and neighboring traces 34 Content Courtesy of Micron
Layout and Termination (8/12) ISI Occurs when data is random Clocks do not have ISI Multiple bits on the bus at the same time Bus cannot settle from bit #1 before bit #2, etc. Signal edges jitter due to previous bit s energy still on the bus Ringing due to impedance mismatches Low pass structures can cause ISI Minimize ISI Optimize layout Keep board/dimm impedances matched Drive impedance should be same as Zo of transmission line Terminate nets Termination values should be the same as Zo of transmission line Select high-quality connector Matched to board/dimm impedance Low mutual coupling 35 Content Courtesy of Micron
Layout and Termination (9/12) Crosstalk Coupling on board, package, and connector from other signals, including RPDs Inductive coupling is typically stronger than capacitive coupling When aggressors fire at the same time as victim (e.g. data-to-data coupling) Victim edge speeds up or slows down, causing jitter When aggressors do not fire at the same time as victim (e.g. data-tocommand/address coupling) Noise couples onto victim at time of aggressor switching 36 Content Courtesy of Micron
Layout and Termination (10/12) Minimize crosstalk Keep bits that switch on same clock edge routed together Route data bits next to other data bits; never next to CMD/ADDR bits Isolate sensitive bits (strobes) If need be, route next to signals that rarely switch Separate traces by at least two to three {preferred} conductor widths (more accurately, one would define by trace pitch and height above reference plane) Example: 5-mil trace located 5 mils from a reference plane should have a 15-mil gap to its nearest neighbors to minimize crosstalk Choose a high-quality connector Run traces as stripline (as opposed to microstrip) Not at the cost of additional vias Maintain good references for signals and their return paths Avoid RPDs Keep driver, BD Zo, and ODT selections well matched 37 Content Courtesy of Micron
Layout and Termination (11/12) Cin mismatch Differing input capacitances on receiver pins Adds skew to input timings RTT mismatch Termination resistors not at nominal value Internal ODT on data pins have smaller variation than on DDR2 They are calibrated (so is DRAM s Ron) External termination resistor variation must be accounted for Consider one-percent resistors 38 Content Courtesy of Micron
Layout and Termination (12/12) High-speed signals must maintain a solid reference plane Reference plane may be either VDD or ground For DDR3 UDIMM systems, the DQ busses are referenced to ground while the ADDR/CMD and clock are referenced to VDD All signals may be referenced to ground if the layout allows Best signaling is obtained when a constant reference plane is maintained If this is not possible try to make the transitions near decoupling capacitors Signal Power Plane Cap Ground Plane Content Courtesy of Micron 39
40 Board Debug Tools to Minimize Issues
TimeQuest DDR Timing: Read Capture Before calibration Calibrating is the out standard some Calibrating of timing the process to analysis the FPGA variation variations in the (deskew memory + pessimism removal) Errors in the calibration Effects algorithm of temperature and voltage changes on the calibration Total margin after calibration 41
EMIF Debug Toolkit Features Reports results of the last calibration to the user Reports interface details, margins observed before calibration, settings made during calibration, and post-calibration margins In the case of a calibration failure, toolkit reports the stage at which calibration failed and the group Provides eye monitor support Provides loopback support Allows user interaction with memory interface Send commands to the memory interface to recalibrate, mask groups and ranks Eye monitor support of data valid window Loopback support for bit error rate (BER) testing 42
TimeQuest-Like GUI interface Reports section Tasks section Commands run Shown in console 43
On-Chip EMIF Debug Toolkit Core access to calibration data Access same calibration data as the EMIF toolkit, now via FPGA logic Via Avalon Memory-Mapped (Avalon-MM) interface 44
45 Looking Ahead and Conclusion
Will There Be a DDR5? Very unlikely SI for a parallel bus of 2 GHz and above would be very difficult Timing budget would be consumed in the package PDN noise Package skew Transition to stack memory Hybrid Memory Cube and serialized memory 3D memories integrated into ASICs 46
Conclusion DDR4 has many ways to reduce overall system power ~50% lower power than DDR3 at 1.5 V DDR4 is 33% faster than DDR3 2133 But there are challenges.. Shrinking data valid window Increase signal integrity and power integrity concerns These can be overcome by good controller design Innovative calibration Good ODT Careful board design Good board debug tools 47
Thank You