Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns

Size: px

Start display at page:

Download "Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns"

Barnaby Logan
5 years ago
Views:

1 March 12, 2018 Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns Wen Wen Lei Zhao, Youtao Zhang, Jun Yang

2 Executive Summary Problems: performance and reliability of write operations The large sneaky currents and IR drop issues in crossbar ReRAM Proposed solutions: speeding up RESET operation based on data pattern Profiling the number of bitline LRS cells by exploiting intrinsic in-memory processing capability of crossbar ReRAM Data compression and row address dependentlayoutto reduce bitline LRS cells Contributions Correlation between the RESET latency and the number of LRS cells on selected bitlines A novel profiling technique to dynamically track the bitline data patterns Results Performance: 20.5% over baseline, 14.2% over state-of-the-art Dynamic energy: 15.7% less than baseline, 7.6% less than state-of-the-art 2

3 Outline Background and Motivation Design Details Low Overhead Runtime Profiling Reduce Bitline LRS Cells Evaluation Methodologies Experimental Results Conclusion 3

4 Outline Background and Motivation Design Details Low Overhead Runtime Profiling Reduce Bitline LRS Cells Evaluation Methodologies Experimental Results Conclusion 4

5 ReRAM Cell Oxygen Ion High Resistance State (HRS, Logic 0 ) Top Metal Layer Metal Oxide Oxygen Vacancy Low Resistance State (LRS, Logic 1 ) Bottom Metal Layer SET: High to Low /RESET: Low to High ReRAM cell structure and two resistance states 5

6 ReRAM Crossbar Wordline Sourceline Bitline Wordline ReRAM Cell 1T1R Structure Wordline Bitline 1D1R Structure Diode ReRAM Cell ReRAM array structures üsmallest 4F 2 planar cell size, low fab cost and better scalability. Sneak currents and IR drop Bitline 0T1R Structure ReRAM Cell Crossbar Structures 6

7 Sneak Currents in Crossbar ReRAM 1/2V 1/2V 1/2V Diode selectors help but cannot eliminate sneak currents Sneaky currents lead to serious IR drop issue Hurt energy efficiency, performance and write reliability V Half-selected cells 1/2V 1/2V 1/2V 0 Sneak Current Selected cell Half-selected cells The Slower RESET operation is the performance bottleneck SET takes shorter time than RESET [Xu et al HPCA15, Zhang et al DATE16] 7

8 How does IR drop affect RESET latency?! # $% & = ( RESET latency is highly sensitive, i.e., exponentially inverse correlation, to voltage drop t:reset switching time; V d :voltage drop C and k are experimental fittings constants A voltage drop of 0.4V results in 10x RESET latency increase [Govoreanu et al IEDM11] 8

9 Intuitive Thoughts Facts Observations 1.During RESET, half-selected cells exhibit as resistive devices. 2.With same voltage stress, a half-selected cell in LRS would have larger sneak current than the one in HRS. 3.RESET operations conservatively use the worst-case access latency of all cells in ReRAM arrays. 1.The number of half-selected cells in LRS affects RESET latency. 2.Dynamically profile and track runtime data patterns may avoid using worst-case RESET latency for all cells. TODO List 1.Explore correlation between RESET latency and the number of bitline LRS cells. 2.Need a runtime profiler to dynamically track bitline data patterns. 3.Need to reduce the number of LRS cells on bitlines. 9

10 RESET latency vs. # of LRS cells RESET Latency (ns) RESET Latency Voltage Bitline LRS Cell Percentage Voltage Drop (V) More LRS cells there are in the bitline, the larger IR drop the sneak current brings, and the longer time the RESET operation takes. 10

11 RESET latency vs. # of LRS cells This impact diminishes as the row becomes closer to the write driver. 11

12 Outline Background and Motivation Design Details Low Overhead Runtime Profiling Reduce Bitline LRS Cells Evaluation Methodologies Experimental Results Conclusion 12

13 Low Overhead Runtime Profiling(1) Mat0 Mat1 bitline-sharing-set Mat63 Wordline Decoders a0 a1 Wordline Decoders a0 a1 Wordline Decoders a0 a1 Shared ADC & Comparator Shared ADC & Comparator Shared ADC & Comparator Profiling Profiling Profiling 3-bit 3-bit 3-bit 64B cacheline: a0 64B cacheline: a1 Update W-Flag = MAX{64 3-bit values} The worst-case bitline data pattern within one bitline-sharing-set determines 13 the optimal RESET latency

14 Mat0 Wordline Decoders a0 a1 Low Overhead Runtime Profiling(1) Dot-Product Operation Vread I1=Vread/Ron Vread I2=Vread/Roff I=I1+I2 Mat1 Ron a0 Roffa1 Wordline Decoders Vread bitline-sharing-set Transmission Gate Mux S/H S/H S/H Shared ADC Mat63 Wordline Decoders a0 a1 From other columns Shared ADC & Comparator Profiling Shared Comparators Shared ADC & Comparator 3-bit counter value Profiling 3-bit 3-bit 3-bit 64B cacheline: a0 64B cacheline: a1 Update W-Flag = MAX{64 3-bit values} Shared ADC & Comparator Profiling The worst-case bitline data pattern within one bitline-sharing-set determines 14 the optimal RESET latency

15 Low Overhead Runtime Profiling(2) Dot-Product Operation Vread I1=Vread/Ron Vread I2=Vread/Roff I=I1+I2 Ron Roff Vread Transmission Gate Mux S/H S/H S/H Shared ADC Shared Comparators 3-bit counter value From other columns Counter = 101 Counter = 100 Counter = 011 Counter = 010 Counter = Counter = Aggregated currents firstly are converted into digital counters, which represent LRS cell percentages. W-Flag is updated by comparing all counters in one bit-sharing-set to decide the worst-case bitline. 15 I/mA Counter = 111 Counter = LRS Cell/% Current to 3-bit value of LRS percentage Safeguarding area

16 Row Address Impact Until now, we ve talked about how to profile bitline data patterns, but we have not exploit the impact of row address on RESET latency yet! The rows with different addresses are mapped to 8 groups with different worst-case RESET latencies. Wordline Decoders Row Address Group #0 : 0-63 Row Address Group #1 : Row Address Group #7 : RESET latency Decrease Write Drivers & SA 16

17 A Summary for Profiling Technique Finding out the worst-case bitline: 3-bit W-Flag Recording the percentage of LRS cells in worst-case bitline Periodically detecting in each mat Tracking the worst-case: 6-bit W-Cnt W-Cnt is cleared when W-Flag is updated Increment the counter of W-Cnt for each write W-Cnt overflow triggers increment of the W-Flag RESET latency optimization W-Flag, W-Cnt and row address are used to determine twr for RESET 17

Determine RESET Timing (ns) RESET latency depends on bitline data patterns (W-Flag, W-Cnt) and row address W-Flag, W-Cnt Row address Row Address Group LRS Cell Ratio 0 1 2 3 4 5 6 7 111 202.4 197.

18 Determine RESET Timing (ns) RESET latency depends on bitline data patterns (W-Flag, W-Cnt) and row address W-Flag, W-Cnt Row address Row Address Group LRS Cell Ratio RESET latency conservatively uses the upper limit number, which is the worstcase in next LRS cell ratio range 18

19 Determine RESET Timing (ns) RESET latency depends on bitline data patterns (W-Flag, W-Cnt) and row address W-Flag, W-Cnt Row address Row Address Group LRS Cell Ratio LRS: %~75%, but 142no exceeding % before W-Cnt 140.9overflows RESET latency conservatively uses the upper limit number, which is the worstcase in next LRS cell ratio range 19

20 Outline Background and Motivation Design Details Low Overhead Runtime Profiling Reduce Bitline LRS Cells Evaluation Methodologies Experimental Results Conclusion 20

21 Original Data Before Compression Reduce Bitline LRS Cells Compressed Data Before Shifting Naïve data compression cannot help since: The RESET latency depends on the worst-case of all 512 bitlines Naïve data compression may not help the worst-case bitline Row-address biased data layout Evenly distribute extra 0s after compression 0b 1b 2b 3b 4b 5b 6b 7b n-bit shifting Compressed Data After Shifting Mat Compressed Data Layout after Shifting in ReRAM Mat Bitline-sharing-sets 21

22 Outline Background and Motivation Design Details Low Overhead Runtime Profiling Reduce Bitline LRS Cells Evaluation Methodologies Experimental Results Conclusion 22

23 Chip-multiprocessor system simulator CPU, multi-level cache and ReRAM memory ReRAM Performance/energy numbers from HSPICE and NVSim Memory: 8GB, 1 channel, 2 ranks, 8 chips/rank, 2Gb x8 ReRAM Chip, 8 banks/chip, 1024 mats/bank ReRAM Timing: Read: 18ns@1.5V, SET: 10ns@3V, RESET is dynamically determined@-3v, 88!A Workloads SPEC2006, BioBench and PARSEC WPKI/RPKI: High, Medium, Low Experimental Methodologies Compared schemes BL: Conventional ReRAM crossbar design with DSGB RA: The state-of-the-art row address awareness technique [Zhang et al DATE 16] LRS: Naïve data pattern profiling technique CMP: LRS + data compression + row-address biased shifting technique ALL: Proposed design with all enhancements 23

24 Outline Background and Motivation Design Details Low Overhead Runtime Profiling Reduce Bitline LRS Cells Evaluation Methodologies Experimental Results Conclusion 24

25 Memory Write Latency Normalized Memory Write Latency BL RA LRS CMP ALL 63% 54% ferret fasta gems zeus gcc cactus perl freq gobmk fluid mean High Medium Low 25

26 Memory Read Latency Normalized Memory Read Latency % BL RA LRS CMP ALL 38% ferret fasta gems zeus gcc cactus perl freq gobmk fluid mean High Medium Low 26

27 System Performance Normalized CPI 1.2 BL RA LRS CMP ALL 1 21% % ferret fasta gems zeus gcc cactus perl freq gobmk fluid mean High Medium Low Normalized Cycle Per Instruction Comparison (The lower, the better) More performance improvements on high memory intensity benchmarks 27

28 Memory Dynamic Energy and EDP 1.2 write read profiling EDP 1.2 Normalized Dynamic Energy BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL BL RA LRS CMP ALL Normalized EDP ferret fasta gems zeus gcc cactus perl freq gobmk fluid mean High Medium Low 15.7% and 7.6% dynamic energy reduction, and 31.9% and 19.5% EDP improvements over baseline and state-of-the-art, respectively. 28

29 Outline Background and Motivation Design Details Low Overhead Runtime Profiling Reduce Bitline LRS Cells Evaluation Methodologies Experimental Results Conclusion 29

30 Conclusion Problems: performance and reliability of write operations The large sneaky currents and IR drop issues in crossbar ReRAM Proposed solutions: speeding up RESET operation based on data pattern Profiling the number of bitline LRS cells by exploiting intrinsic in-memory processing capability of crossbar ReRAM Data compression and row address dependentlayoutto reduce bitline LRS cells Contributions Correlation between the RESET latency and the number of LRS cells on selected bitlines A novel profiling technique to dynamically track the bitline data patterns Results Performance: 20.5% over baseline, 14.2% over state-of-the-art Dynamic energy: 15.7% less than baseline, 7.6% less than state-of-the-art 30

31 Thank you! 31

32 March 12, 2018 Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns Wen Wen Lei Zhao, Youtao Zhang, Jun Yang

33 Backup Slides 33

34 Prior Arts Performance of RESET operation Current accumulation feature of ReRAM crossbar Double-sided ground biasing and multi-phase RESET [Xu et al HPCA15] Row address dependent logical regions [Zhang et al DATE16] No studies on impact of bitline data patterns Using ReRAM crossbar to implement dot-product analogy calculations [Bojnordi et al HPCA16, Chi et al ISCA16, Shafiee et al ISCA16, Song et al HPCA17] Few studies on exploiting this feature to accelerate memory access Data pattern in ReRAM crossbar Correlation between voltage drop and data pattern in ReRAM crossbar [Liang et al TED2010] Correlation between Read latency and bitline data pattern [Zhang et al JETCAS15] Correlation between RESET latency and the number of RESET bits [Xu et al HPCA15] No observations on correlation between RESET latency and bitline data pattern 34

35 RESET latency vs. # of LRS cells Build and simulate crossbar ReRAM HSPICE model Key parameters are summarized below Metric Description Value A Mat Size: A wordlines A bitlines n Number of bits to read/write 8 Rwire Wire resistance between adjacent cells 2.82! Kr Nonlinearity of the selector 200 Vw Full selected voltage during write 3V - Voltage biasing Scheme DSGB 35

36 Overhead Analysis Profiling Power consumption/area estimation by HSPICE and NVSim at 32nm Profiling overhead is small: 3.7x read energy, negligible area overhead Profiling overhead of one bank is summarized as below Comp. Params Spec. Power/Energy Area (mm 2 ) ADC Sampling speed Resolution Number 1.28GS/s 8-bit 8 Counters storage and RESET adjustment 3-bit W-Flag and 6-bit W-Cnt for each bitline-sharing-set 288KB storage overhead of all flags for a 8GB memory 24.48mW S+H Number uW ReRAM Array Mat number Mat size Profiling: pJ Read: pJ Leakage: mW

37 Sensitivity Study Normalized CPI Normalized CPI Normalized Dynamic Energy 14% BL RA LRSCMPALL BL RA LRSCMPALL BL RA LRSCMPALL BL RA LRSCMPALL Normalized Dynamic Energy 8 ADC units 16 ADC units 256x x512 Sensitivity to # of ADC units Sensitivity to mat size Trivial improvement by doubling the number of ADC units Only 1.1% performance improvementobserved The proposed scheme is slightly worse (only 1.6%) than RA for 256 x 256 mat size Profiling overhead is independent on mat size, and has larger impacton smaller mats 37

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong