VLSID KOLKATA, INDIA January 4-8, 2016

Size: px

Start display at page:

Download "VLSID KOLKATA, INDIA January 4-8, 2016"

Gilbert Robertson
6 years ago
Views:

VLSID 2016 KOLKATA, INDIA January 4-8, 2016

Colorado State University, Fort Collins,

1 VLSID 2016 KOLKATA, INDIA January 4-8, 2016 Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures Ishan Thakkar, Sudeep Pasricha Department of Electrical and Computer Engineering Colorado State University, Fort Collins, CO, U.S.A. {ishan.thakkar, DOI /VLSID

2 Outline Introduction Background on DRAM Structure and Refresh Operation Related Work Contributions Evaluation Setup Evaluation Results Conclusion 1

3 Outline Introduction Background on DRAM Structure and Refresh Operation Related Work Contributions Evaluation Setup Evaluation Results Conclusion 2

4 Bit Line Introduction Main memory is DRAM It is a critical component of all computing systems: server, desktop, mobile, embedded, sensor DRAM stores data in cell capacitor Fully charged cell-capacitor logic 1 Fully discharged cell-capacitor logic 0 DRAM: Dynamic Random Access Memory Word Line Access Transistor DRAM cell loses data over time, as cell-capacitor leaks charge over time For temperatures below 85 C, DRAM cell loses data in 64ms For higher temperatures, DRAM cell loses data at faster rate Cell Capacitor To preserve data integrity, the charge on each DRAM cell (cell-capacitor) must be periodically restored or refreshed. 3

5 Outline Introduction Background on DRAM Structure and Refresh Operation Related Work Contributions Evaluation Setup Evaluation Results Conclusion 4

6 Background on DRAM Structure Based on their structure, DRAMs are classified in two categories: 1. 2D DRAMs: Planar single layer DRAMs 2. 3D-Stacked DRAMs: Multiple 2D DRAM layers stacked on one-another using TSVs 2D DRAM structure TSV: Through Silicon Via 2D DRAM Structure Hierarchy Rank Chip Bank Subarray Bitcell 5

7 <N> <N> <N> <N> <N> 2D DRAM: Rank and Chip Structure DRAM Rank DRAM Chip DRAM Chip... 2D DRAM rank: Multiple chips work in tandem Mux 6

8 3D-Stacked DRAM Structure In this paper, we consider Hybrid Memory Cube (HMC), which is as a standard for 3D-Stacked DRAMs defined by a consortium of industries HMC Structure Hierarchy Vault Bank Subarray Bitcell Hybrid Memory Cube 7

9 Column Address Decoder Row Address Decoder Rows Subarray DRAM Bank Structure 3D-Stacked and 2D DRAMs have similar bank structures Bank Peripherals Columns Bank Core Sense Amplifiers Sense Amplifiers Row Buffer Column Mux Data bits 8

10 Bit Line Row Address Bit Line DRAM Subarray Structure 3D-Stacked and 2D DRAMs have similar subarray structures Word Line Sense Amps Access Transistor Cell Capacitor Word Line DRAM Cell DRAM Cell Sense Amp Sense Amp Sense Amp 9

11 EN Subarray Dec. Global Address Latch =ID? =ID? EN Subarray Dec. Basic DRAM Operations PRECHARGE Sense Amplifiers All bitlines of the bank are pre-charged to 0.5 V DD Global Row Dec. Sense Amplifiers Row Buffer Column Address Decoder Column Mux 10

12 EN Subarray Dec. Global Address Latch =ID? =ID? EN Subarray Dec. Basic DRAM Operations Subarray ID: 1 PRECHARGE ACTIVATION Row 4 Row 4 Sense Amplifiers Row Address Global Row Dec. Sense Amplifiers The target row is opened, Row Buffer Column Address Decoder Column Mux 11

13 EN Subarray Dec. Global Address Latch =ID? =ID? EN Subarray Dec. Basic DRAM Operations Subarray ID: 1 PRECHARGE ACTIVATION Row 4 Row Address Global Row Dec. Sense Row Amplifiers 4 Sense Amplifiers The target row is opened, then it s captured by SAs Row Buffer Column Address Decoder Column Mux 12

14 EN Subarray Dec. Global Address Latch =ID? =ID? EN Subarray Dec. Basic DRAM Operations Subarray ID: 1 PRECHARGE ACTIVATION Row 4 Row 4 Sense Row Amplifiers 4 Row Address Global Row Dec. Sense Amplifiers Row Buffer SAs drive each bitline fully either to V DD or 0V restore the open row Column Address Decoder Column Mux 13

15 EN Subarray Dec. Global Address Latch =ID? =ID? EN Subarray Dec. Basic DRAM Operations Subarray ID: 1 PRECHARGE ACTIVATION Row 4 Row 4 Sense Amplifiers Row Address Global Row Dec. Sense Amplifiers Row Row Buffer 4 Open row is stored in global row buffer Column Address Decoder Column Mux 14

Row Amplifiers 4 Row Address Global Row Dec.

16 EN Subarray Dec. Global Address Latch =ID? =ID? EN Subarray Dec. Basic DRAM Operations Subarray ID: 1 PRECHARGE ACTIVATION Row 4 Row 4 READ Sense Row Amplifiers 4 Row Address Global Row Dec. Column 1 Column Address Decoder Sense Amplifiers Row Buffer Column Mux Target data block is selected, and then multiplexed out from row buffer 15

EN Subarray Dec. Global Address Latch =ID? =ID? EN Subarray Dec. Basic DRAM Operations Subarray ID: 1 PRECHARGE ACTIVATION Row 4 Row 4 READ Sense Row Amplifiers 4 Row Address Global Row Dec.

17 EN Subarray Dec. Global Address Latch =ID? =ID? EN Subarray Dec. Basic DRAM Operations Subarray ID: 1 PRECHARGE ACTIVATION Row 4 Row 4 READ Sense Row Amplifiers 4 Row Address Global Row Dec. Sense Amplifiers Column Address Decoder Row Buffer A duet of PRECHARGE-ACTIVATION operations restores/refreshes the target row dummy Column Mux PRECHARGE-ACTIVATION Column 1 operations are performed to refresh the rows 16

Refresh Refresh: 2D Vs 3D-Stacked DRAMs 3D-Stacked DRAMs have Higher capacity/density more rows need to be refreshed Higher power density higher operating temperature (>85 C) smaller retention period

18 Refresh Refresh: 2D Vs 3D-Stacked DRAMs 3D-Stacked DRAMs have Higher capacity/density more rows need to be refreshed Higher power density higher operating temperature (>85 C) smaller retention period (time before DRAM cells lose data) of 32ms than that of 64ms for 2D DRAMs Thus, refresh problem for 3D-Stacked DRAMs is more critical Therefore, in this study, we target a standardized 3D-Stacked DRAM architecture HMC Dummy ACTIVATION-PRECHARGE are performed on all rows every retention cycle (32 ms) To prevent long pauses a JEDEC standardized Distributed Refresh method is used 17

19 Background: Refresh Operation Distributed Refresh JEDEC standardized method A group of n rows are refreshed every 3.9μs A group of n rows form a Refresh Bundle (RB) Size of RB increases w/ increase in DRAM capacity increases trfc Example Distributed Refresh Operation 1Gb HMC Vault trefi = 3.9µs trfc RB1 Retention Cycle = 32ms trefi = 3.9µs trfc RB2 trefi = 3.9µs trfc RB8192 trefi: Refresh Interval trfc: Refresh Cycle Time Size of RB is 16 trfc trc Row1 trec trc Row2 trec trc trec trc trc trec Row3 Row4 Row15 trc Row16 trc: Row Cycle Time trfc = time taken to refresh entire RB 18

20 Performance Overhead of Distributed Refresh Source: J Liu+, ISCA 2012 Performance overhead of refresh increases with increase in device capacity 19

21 Energy Overhead of Distributed Refresh Source: J Liu+, ISCA 2012 Energy overhead of refresh increases with increase in device capacity 20

22 Energy Overhead of Distributed Refresh Source: J Liu+, ISCA 2012 Refresh is a growing problem, which needs to be addressed to realize low-latency, low-energy DRAMs Energy overhead of refresh increases with increase in device capacity 21

23 Outline Introduction Background on DRAM Structure and Refresh Operation Related Work Contributions Evaluation Setup Evaluation Results Conclusion 22

24 Related Work Scattered Refresh improves upon Per-bank Refresh and All-bank Refresh We improve upon Scattered Refresh 23

25 All-Bank Refresh Vs Per-Bank Refresh Distributed Refresh can be implemented at two different granularities All-bank Refresh: All banks are refreshed simultaneously, and none of the banks is allowed to serve any request until refresh is complete Supported by all general purpose DDRx DRAMs DRAM operation is completely stalled no. of available banks (#AB) is zero Exploits bank-level parallelism (BLP) for refreshing smaller trfc Per-bank Refresh: Only one bank is refreshed at a time, so all other banks are allowed to serve other requests Supported by LPDDRx DRAMs #AB > 0 No BLP larger value of trfc trfc: Refresh Cycle Time 24

completely stalled Dummy ACTIVATION-PRECHARGE operations for refresh command Per-Bank Refresh #AB > 0 No BLP

26 All-Bank Refresh Vs Per-Bank Refresh All-Bank Refresh L = Layer ID B = Bank ID SA = Saubarray ID R = Row ID Smaller value of trfc trfc: Refresh Cycle Time Number of available banks (#AB) = 0 DRAM operation is completely stalled Dummy ACTIVATION-PRECHARGE operations for refresh command Per-Bank Refresh #AB > 0 No BLP larger value of trfc trc: Row Cycle Time Both All-bank Refresh and Per-bank Refresh have drawbacks and they can be improved 25

27 Scattered Refresh Source: T Kalyan+, ISCA 2012 Improves upon Per-bank Refresh uses subarray-level parallelism (SLP) for refresh Each row of RB is mapped to a different subarray SLP gives opportunity to overlap PRECHARGE with next ACTIVATE reduces trfc Example Scattered Refresh Operation HMC Vault Refresh Bundle size of 4 Scattered L = Layer ID B = Bank ID SA = Saubarray ID R = Row ID How does Scattered Refresh compare to Per-bank Refresh and All-bank Refresh? 26

28 Scattered Refresh Example Scattered Refresh Operation HMC Vault Refresh Bundle size of 4 Per-Bank Scattered All-Bank trfc for All-bank Refresh < trfc for Scattered Refresh < trfc for Per-bank Refresh Room for improvement - Scattered Refresh 27

29 Outline Introduction Background on DRAM Structure and Refresh Operation Related Work Contributions Evaluation Setup Evaluation Results Conclusion 28

30 Contributions #BLP: Bank-level Parallelism #SLP: Subarray-level Parallelism #AB: Number of banks available to serve other requests while remaining banks are being refreshed Crammed Refresh: Per-bank Refresh + All-bank Refresh 2 banks are refreshed in parallel, instead of 1 bank in Per-bank Refresh and all banks in All-bank Refresh Massed Refresh: Crammed Refresh + Scattered Refresh 2 banks are refreshed in parallel Uses SLP in both banks being refreshed Only 2 banks are refreshed in parallel proof of concept More than 2 banks can also be chosen Idea is to keep balance between #AB and BLP for refresh 29

31 Crammed Refresh trfc Timing Example Crammed Refresh Operation HMC Vault Refresh Bundle size of 4 Per-Bank Scattered Crammed L = Layer ID B = Bank ID SA = Saubarray ID R = Row ID trfc for Crammed Refresh < trfc for Scattered Refresh Bank-level parallelism (BLP) for refresh Only 2 banks are refreshed in parallel #AB>0 30

Massed Refresh trfc Timing Example Massed Refresh Operation HMC Vault Refresh Bundle size of 4 Per-Bank Crammed Massed L = Layer ID B = Bank ID SA = Saubarray ID R = Row

32 Massed Refresh trfc Timing Example Massed Refresh Operation HMC Vault Refresh Bundle size of 4 Per-Bank Crammed Massed L = Layer ID B = Bank ID SA = Saubarray ID R = Row ID Bank-level parallelism (BLP) + Subarray-level parallelism (SLP) for refresh trfc for Massed Refresh < trfc for Crammed Refresh How to implement BLP and SLP together? 31

33 Subarray-level Parallelism (SLP) Source: Y Kim+, ISCA 2012 Global Row-address Latch Per-Subarray Row-address Latch Global Row-address Latch hinders SLP 32

34 Bank-level Parallelism (BLP) BLP is implemented by masking BankID during refresh To Banks Memory die 4 Memory die 3 Memory die 2 Memory die 1 TSV Launch Pads LayerID LID Row Addr Latch Mask BankID BID EN Logic Base (LoB) Vault Controller Refresh Controller Refresh Scheduler Control Physical Addr Decoder Address Calculator LayerAddr[2] BankAddr[1] RowAddr[14] Physical Address Latch 17-bit Address Counter 33

35 Outline Introduction Background on DRAM Structure and Refresh Operation Related Work Contributions Evaluation Setup Evaluation Results Conclusion 34

Evaluation Setup Trace-driven simulation for PARSEC benchmarks Memory access traces extracted from detailed cycle-accurate simulations using gem5 These memory traces were then provided as

36 Evaluation Setup Trace-driven simulation for PARSEC benchmarks Memory access traces extracted from detailed cycle-accurate simulations using gem5 These memory traces were then provided as inputs to the DRAM simulator DRAMSim2 Energy, timing and area analysis CACTI-3DD based simulation based on 4Gb HMC quad model DRAMSim2 configuration Configured DRAMSim2 using CACTI-3DD results 35

37 Outline Introduction Background on DRAM Structure and Refresh Operation Related Work Motivation Massed Refresh Technique Evaluation Setup Evaluation Results Conclusion 36

38 Results I Energy, Timing, Area 37

39 Results II Throughput PARSEC Benchmarks Crammed refresh achieves 7.1% and 2.9% more throughput on average over distributed per-bank refresh and scattered refresh respectively Massed refresh achieves 8.4% and 4.3% more throughput on average over distributed per-bank refresh and scattered refresh respectively 38

40 Results III Energy Delay Product (EDP) PARSEC Benchmarks Crammed refresh achieves 6.4% and 2.7% less EDP on average over distributed per-bank refresh and scattered refresh respectively Massed refresh achieves 7.5% and 3.9% less EDP on average over distributed per-bank refresh and scattered refresh respectively 39

41 Outline Introduction Background on DRAM Structure and Refresh Operation Related Work Motivation Massed Refresh Technique Evaluation Setup Evaluation Results Conclusion 40

42 Conclusions Proposed Massed Refresh technique exploits Bank-level as well as subarray-level parallelism while refresh operations Proposed Crammed Refresh and Massed Refresh techniques Improve throughput and energy-efficiency of DRAM Crammed Refresh improves upon state-of-the-art 7.1% & 6.4% improvements in throughput and EDP over the distributed per-bank refresh 2.9% & 2.7% improvements in throughput and EDP over the scattered refresh schemes respectively Massed Refresh improves upon state-of-the-art 8.4% & 7.5% improvements in throughput and EDP over the distributed per-bank refresh 4.3% & 3.9% improvements in throughput and EDP over the scattered refresh schemes respectively 41

43 Thank You Questions / Comments? 42

IN recent years, DRAM latency has not improved as rapidly

IN recent years, DRAM latency has not improved as rapidly 168 IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS, VOL. 1, NO. 3, JULY-SEPTEMBER 2015 3D-ProWiz: An Energy-Efficient and Optically-Interfaced 3D DRAM Architecture with Reduced Data Access Overhead