DRAM Bank Organization

Size: px

Start display at page:

Download "DRAM Bank Organization"

Ferdinand Shaw
5 years ago
Views:

1 DRM andwidth

2 DRM ank Organization Row ddr Row Decoder Memory Cell Core rray DRM Memory Cell Sense mps Column Latches Column ddr Mux Mux Off-chip Data

3 DRM Core rrays are Slow

4 DRM Core rrays are Slow DDR: Core speed = 0.5 interface speed DDR2/GDDR3: Core speed = 0.25 interface speed DDR3/GDDR4: Core speed = interface speed DDR4/GDDR5: Core speed = interface speed

5 DRM ursting DRM core arrays are slow DDR3/GDDR4: Core speed = 1/8 interface speed

6 DRM ursting DRM core arrays are slow DDR3/GDDR4: Core speed = 1/8 interface speed SDRM cores are clocked at 1/N speed of the interface Load (N x interface width) of DRM bits from the same row on an internal buffer Transfer in N steps at interface speed DDR3/GDDR4: uffer Width = 8x interface width

7 DRM ursting Non-burst timing Time urst timing Core rray ccess Delay 1 cm bits to pin Modern DRM systems are are designed to to be be always accessed in in burst mode. urst bytes are are transferred but but discarded when accesses are are not not to to sequential locations.

8 DRM ursting Single-bank burst timing, dead time on interface Multi-bank burst timing, reduced dead time

9 GPU Off-Chip Memory Subsystem nvidia GTX280 GPU Peak global memory bandwidth = 141.7G/s

10 GPU Off-Chip Memory Subsystem nvidia GTX280 GPU Peak global memory bandwidth = 141.7G/s Global memory (GDDR3)

11 GPU Off-Chip Memory Subsystem nvidia GTX280 GPU Peak global memory bandwidth = 141.7G/s Global memory (GDDR3) 64 bit interface

12 GPU Off-Chip Memory Subsystem nvidia GTX280 GPU Peak global memory bandwidth = 141.7G/s Global memory (GDDR3) (Core 276Mhz) Typical 64-bit interface (8 x8), Memory andwidth = 17.6 G/s (DDR - 2 transfers per clock)

13 GPU Off-Chip Memory Subsystem nvidia GTX280 GPU Peak global memory bandwidth = 141.7G/s Global memory (GDDR3) (Core 276Mhz) Typical 64-bit interface (8 x8), Memory andwidth = 17.6 G/s (DDR - 2 transfers per clock) To feed G/s, 8 memory channels are needed

14 DRM urst urst Section urst Section urst Section urst Section

15 DRM urst urst Section urst Section urst Section urst Section Each address space is partitioned into burstsize sections Whenever a location is accessed, all other locations in the same section are also delivered to the processor

16 DRM urst urst Section urst Section urst Section urst Section Example: 16-byte address space, 4-byte bursts In practice, at least 4G address space, 128-byte bursts

17 Coalesced ccesses Coalesced Loads T 0 T 1 T 2 T urst Section urst Section urst Section urst Section

18 Coalesced ccesses Coalesced Loads T 0 T 1 T 2 T urst Section urst Section urst Section urst Section Load instruction for threads in warp if all accessed locations fall into the same burst section, one DRM request is made and the access is fully coalesced.

19 Un-coalesced ccesses Un-coalesced Loads Coalesced Loads T 0 T 1 T 2 T 3 T 0 T 1 T 2 T urst Section urst Section urst Section urst Section When the accessed locations spread across burst section boundaries, coalescing fails, multiple DRM requests are made and the access is not fully coalesced.

20 Warp Coalescence ccesses are to consecutive locations if the index in an array access is in the form [(terms independent of threadidx.x)+ threadidx.x];

21 2D C rray in Linear Memory Space Linearized order in increasing address

22 ccess Patterns of asic MM Row n Col k n m [Row*n+i] [i*k+col]

23 ccess Patterns of asic MM n k Row Col n m [Row*n+i] [i*k+col]

24 accesses are coalesced Load Load iteration 0 T 0 T 1 T 2 T 3 Load Load iteration 1 T 0 T 1 T 2 T ccess direction in neighbouring threads ccess direction in a thread

25 accesses are not coalesced Load Load iteration 1 T 0 T 1 T 2 T 3 Load Load iteration 0 T 0 T 1 T 2 T ccess direction in a thread ccess direction for neighbouring threads

26 Loading an Input Tile Have each thread to load an element and a element at the same relative position as its C element. k int int tx tx = threadidx.x int int ty ty = threadidx.y Col n n ccessing tile0 2D indexing: [Row][tx] [ty][col] Row m C

27 Corner Turning

28 Corner Turning

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth