Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM

Size: px

Start display at page:

Download "Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM"

Stephanie Robbins
6 years ago
Views:

1 Low-Cost Inter-Linked ubarrays (LIA) Enabling Fast Inter-ubarray Data Movement in DRAM Kevin Chang rashant Nair, Donghyuk Lee, augata Ghose, Moinuddin Qureshi, and Onur Mutlu

2 roblem: Inefficient Bulk Data Movement Bulk data movement is a key operation in many applications memmove & memcpy: 5% cycles in Google s datacenter [Kanev+ ICA 15] Core Core Core Core LLC Memory Controller Channel 64 bits src dst CU Memory Long latency and high energy 2

3 Moving Data Inside DRAM? Bank Bank Bank Bank DRAM 512 rows DRAM cell 8Kb ubarray 1 ubarray 2 ubarray 3 ubarray N Internal Data Bus (64b) Low Goal: connectivity rovide a in new DRAM substrate is the fundamental to enable wide bottleneck connectivity for bulk between data movement subarrays 3

4 Key Idea and Applications Low-cost Inter-linked subarrays (LIA) Fast bulk data movement between subarrays Wide datapath via isolation transistors: 0.8% DRAM chip area ubarray 1 ubarray 2 LIA is a versatile substrate new applications Fast bulk data copy: Copy latency 1.363ms 0.148ms (9.2x) 66% speedup, -55% DRAM energy In-DRAM caching: Hot data access latency 48.7ns 21.5ns (2.2x) 5% speedup Fast precharge: recharge latency 13.1ns 5.0ns (2.6x) 8% speedup 4

5 Outline Motivation and Key Idea DRAM Background LIA ubstrate New DRAM Command to Use LIA Applications of LIA 5

6 DRAM Internals Row Decoder ubarray Bitline Wordline ubarray 512 x 8Kb Internal Data Bus 64b Row Buffer ense amplifier recharge unit I/O Bank (16~64 As) 8~16 banks per chip 6

7 DRAM Operation ACTIVATE: tore the row into the row buffer To Bank I/O 2 3 READ: elect the target column and drive to I/O RECHARGE: Reset the bitlines for a new ACTIVATE V dd /2 Bitline Voltage Level: V dd 7

8 Outline Motivation and Key Idea DRAM Background LIA ubstrate New DRAM Command to Use LIA Applications of LIA 8

9 Observations Internal Data Bus (64b) 1 Bitlines serve as a bus that is as wide as a row Bitlines between subarrays are 2 close but disconnected 9

10 Low-Cost Interlinked ubarrays (LIA) 8kb ON 64b Interconnect bitlines of adjacent subarrays in a bank using isolation transistors (links) LIA forms a wide datapath b/w subarrays 10

11 New DRAM Command to Use LIA Row Buffer Movement (RBM): Move a row of data in an activated row buffer to a precharged one ubarray 1 V dd V dd -Δ Activated RBM: A1 A2 ubarray 2 on Charge haring V dd /2+Δ V dd RBM transfers recharged Activated an entire Amplify row the b/w charge subarrays 11

12 RBM Analysis The range of RBM depends on the DRAM design Multiple RBMs to move data across > 3 subarrays ubarray 1 ubarray 2 ubarray 3 Validated with ICE using worst-case cells NCU FreeDK 45nm library 4KB data in 8ns (w/ 60% guardband) 500 GB/s, 26x bandwidth of a DDR channel 0.8% DRAM chip area overhead [O+ ICA 14] 12

13 Outline Motivation and Key Idea DRAM Background LIA ubstrate New DRAM Command to Use LIA Applications of LIA 1. Rapid Inter-ubarray Copying (RIC) 2. Variable Latency DRAM (VILLA) 3. Linked recharge (LI) 13

14 1. Rapid Inter-ubarray Copying (RIC) Goal: Efficiently copy a row across subarrays Key idea: Use RBM to form a new command sequence 3 Activate dst row (write row buffer into dst row) ubarray 1 1 Activate src row src row 2 RBM A1 A2 ubarray 2 dst row Reduces row-copy latency by 9.2x, DRAM energy by 48.1x 14

15 Methodology Cycle-level simulator: Ramulator [CAL 15] CU: 4 out-of-order cores, 4GHz L1: 64KB/core, L2: 512KB/core, L3: shared 4MB DRAM: DDR3-1600, 2 channels Benchmarks: Memory-intensive: TC, TREAM, EC2006, DynoGraph, random Copy-intensive: Bootup, forkbench, shell script 50 workloads: Memory- + copy-intensive erformance metric: Weighted peedup (W) 15

16 Comparison oints Baseline: Copy data through CU (existing systems) RowClone [eshadri+ MICRO 13] In-DRAM bulk copy scheme Fast intra-subarray copying via bitlines low inter-subarray copying via internal data bus 16

17 ystem Evaluation: RIC Over Baseline (%) 75 66% RowClone RIC 60 55% % 0-15 Rapid Inter-ubarray Degrades Copying bank-level parallelism (RIC) using LIA -30 improves system performance -24% W Improvement DRAM Energy Reduction 17

18 2.Variable Latency DRAM (VILLA) Goal: Reduce DRAM latency with low area overhead Motivation: Trade-off between area and latency Long Bitline (DDRx) hort Bitline (RLDRAM) horter bitlines faster activate and precharge time High area overhead: >40% 18

19 2. Variable Latency DRAM (VILLA) Key idea: Reduce access latency of hot data via a heterogeneous DRAM design [Lee+ HCA 13, on+ ICA 13] VILLA: Add fast subarrays as a cache in each bank low ubarray Fast ubarray low ubarray 512 rows Challenge: VILLA cache requires frequent movement of data rows 32 LIA: Cache rows rapidly from slow rows to fast subarrays Reduces hot data access latency by 2.2x at only 1.6% area overhead 19

20 ystem Evaluation: VILLA 50 quad-core workloads: memory-intensive benchmarks Normalized peedup VILLA VILLA Cache Hit Rate Workloads (50) Max: 16% Avg: 5% Caching hot data in DRAM using LIA 1 improves system performance VILLA Cache Hit Rate (%) 20

21 3. Linked recharge (LI) roblem: The precharge time is limited by the strength of one precharge unit Linked recharge (LI): LIA precharges a subarray using multiple precharge units on Conventional DRAM recharging LIA DRAM Activated row Linked recharging Reduces precharge latency by 2.6x on on (43% guardband) 21

22 ystem Evaluation: LI 50 quad-core workloads: memory-intensive benchmarks Normalized peedup LI Max: 13% Avg: 8% Accelerating precharge using LIA improves system performance Workloads (50) 22

23 Other Results in aper Combined applications ingle-core results ensitivity results LLC size Number of channels Copy distance Qualitative comparison to other hetero. DRAM Detailed quantitative comparison to RowClone 23

24 ummary Bulk data movement is inefficient in today s systems Low connectivity between subarrays is a bottleneck Low-cost Inter-linked subarrays (LIA) Bridge bitlines of subarrays via isolation transistors Wide datapath with 0.8% DRAM chip area LIA is a versatile substrate new applications Fast bulk data copy: 66% speedup, -55% DRAM energy In-DRAM caching: 5% speedup Fast precharge: 8% speedup LIA can enable other applications ource code will be available in April 24

25 Low-Cost Inter-Linked ubarrays (LIA) Enabling Fast Inter-ubarray Data Movement in DRAM Kevin Chang rashant Nair, Donghyuk Lee, augata Ghose, Moinuddin Qureshi, and Onur Mutlu

26 Backup 26

27 ICE on RBM 27

28 Comparison to rior Works Heterogeneous DRAM Designs TL-DRAM (Lee+ HCA 13) CHARM (on+ ICA 13) VILLA Level of Heterogeneity Intra- ubarray Inter- Bank Inter- ubarray Caching Latency X Cache Utilization X 28

29 Comparison to rior Works Goal DRAM Designs Enable other applications? Movement mechanism calable Copy Latency DA-DRAM (Lu+ MICRO 15) Heterogeneous DRAM design X Migration cells X LIA ubstrate for bulk data movement Low-cost links 29

30 LIA vs. amsung atent.-y. eo, Methods of Copying a age in a Memory Device and Methods of Managing ages in a Memory ystem, U.. atent Application , 2014 Only for copying data Vague. Lack of detail on implementation How does data get moved? What are the steps? No analysis on the latency and energy No system evaluation 30

31 RBM Across 3 ubarrays ubarray 1 RBM: A1 A3 ubarray 2 ubarray 3 31

32 Comparison of Inter-ubarray Row Copying DRAM Energy (µj) RIC RowClone [MICRO'13] memcpy (baseline) hops 9x latency and 48x energy reduction Latency (ns) 32

33 RIC: Cache Coherence Data in DRAM may not be up-to-date MC performs flushes dirty data (src) and invalidates dst Techniques to accelerate cache coherence Dirty-Block Index [eshadri+ ICA 14] Other papers handle the similar issue [eshadri+ MICRO 13, CAL 15] 33

34 RIC vs. RowClone 4-core results RowClone 34

35 ensitivity of Cache ize ingle core: RIC vs. baseline as LLC size changes Baseline: higher cache pollution as LLC size decreases Forkbench RIC: Hit rate 67% (4MB) to 10% (256KB) Base: Hit rate 20% to 19% 35

36 Combined Applications +8% +16% 59% 36

37 ensitivity to Copy Distance 37

38 VILLA Caching olicy Benefit-based caching policy [HCA 13] A benefit counter to track # of accesses per cached row Any caching policy can be applied to VILLA Configuration 32 rows inside a fast subarray 4 fast subarrays per bank 1.6% area overhead 38

39 Area Measurement Row-Buffer Decoupling by O et al., ICA 14 28nm DRAM process, 3 metal layers 8Gb and 8 banks per device 39

40 Other slides 40

41 Low-Cost Inter-Linked ubarrays (LIA) Enabling Fast Inter-ubarray Data Movement in DRAM Kevin Chang rashant Nair, Donghyuk Lee, augata Ghose, Moinuddin Qureshi, and Onur Mutlu

42 3. Linked recharge (LI) roblem: The precharge time is limited by the strength of one precharge unit (U) Linked recharge (LI): LIA s connectivity enables DRAM to utilize additional Us from other subarrays on recharging on Conventional DRAM LIA DRAM Activated row Linked recharging on 42

43 Key Idea and Applications Low-cost Inter-linked subarrays (LIA) Fast bulk data movement b/w subarrays Wide datapath via isolation transistors: 0.8% DRAM chip area ubarray 1 ubarray 2 LIA is a versatile substrate new applications 1. Fast bulk data copy: Copy latency 1.3ms 0.1ms (9x) 66% sys. performance and 55% energy efficiency 2. In-DRAM caching: Access latency 48ns 21ns (2x) 5% sys. performance 3. Linked precharge: recharge latency 13ns 5ns (2x) 8% sys. performance 43

44 Low-Cost Inter-Linked ubarrays (LIA) Row Decoder LIA link 44

45 45

46 Key Idea and Applications Low-cost Inter-linked subarrays (LIA) Fast bulk data movement b/w subarrays Wide datapath via isolation transistors: 0.8% DRAM chip area ubarray 1 ubarray 2 LIA is a versatile substrate new applications Fast bulk data copy: Copy latency 1.363ms 0.148ms (9.2x) +66% speedup and -55% DRAM energy efficiency In-DRAM caching: Hot data access latency 48.7ns 21.5ns (2.2x) 5% sys. performance Fast precharge: recharge latency 13.1ns 5ns (2.6x) 8% sys. performance 46

47 Moving Data Inside DRAM? Bank Bank Bank Bank DRAM 512 rows DRAM cell 8Kb ubarray 1 ubarray 2 ubarray 3 ubarray N Internal Data Bus (64b) Low connectivity in DRAM is the fundamental bottleneck for bulk data movement Goal: rovide a new substrate to enable wide connectivity between subarrays 47

48 Low Connectivity in DRAM roblem: imply moving data inside DRAM is inefficient Bank Bank Bank Bank DRAM 512 rows DRAM cell 8Kb ubarray 1 ubarray 2 ubarray 3 ubarray N Internal Data Bus (64b) Goal: rovide a new substrate to Low connectivity in DRAM is the fundamental bottleneck for bulk data movement enable wide connectivity b/w subarrays 48

49 Low Connectivity in DRAM roblem: imply moving data inside DRAM is inefficient Bank Bank Bank Bank DRAM 512 rows DRAM cell 8Kb Internal Data Bus (64b) Low connectivity in DRAM is the fundamental bottleneck for bulk data movement 49

Understanding and Improving Latency of DRAM-Based Memory Systems

Understanding and Improving Latency of DRAM-Based Memory ystems Thesis Oral Kevin Chang Committee: rof. Onur Mutlu (Chair) rof. James Hoe rof. Kayvon Fatahalian rof. tephen Keckler (NVIDIA, UT Austin)