SCOPE: A Stochastic Computing Engine for DRAM-based In-situ Accelerator

Size: px

Start display at page:

Download "SCOPE: A Stochastic Computing Engine for DRAM-based In-situ Accelerator"

Maud Oliver
5 years ago
Views:

1 SCOPE: A Stochastic Computing Engine for DRAM-based In-situ Accelerator Shuangchen Li, Alvin Oliver Glova, Xing Hu, Peng Gu, Dimin Niu*, Krishna T. Malladi*, Hongzhong Zheng*, Bob Brennan*, and Yuan Xie University of California, Santa Barbara *Memory Solutions Lab, Samsung Semiconductor Inc.

2 Executive Summary Introducing Stochastic Computing to In-DRAM Computing Architecture: For solving slow multiplication (MUL), Stochastic Leverage large capacity and bandwidth. Computing DRAM-based In-situ Acc. MUL AND Three arithmetic techniques (H 2 D) to improve stochastic computing on such architecture. In-DRAM comp. H 2 D Arithmetic Experiments shows 2.3x performance improvement v.s. w/o stochastic computing. 2

.. DRAM-based In-situ Accelerator rocks: with computing engines at

3 ... On-chip memory capacity DRAM in-situ Accelerator Rocks Cells Cells Cells Bitline... DRAM-based In-situ Accelerator rocks: with computing engines at BL-level, tightly bound memory and computing. NOR SA Compute-Capable 2D/3D Memory[neurocube] DRAM Memory-rich Processor[dadiannao] Performance 3

4 ... Cells Cells Cells DRAM in-situ Accelerator Rocks Challenges Bitline NOR SA... DRAM Run Boolean logic operation in SERIAL. For example: R = S X + ሚS Y NOR-only logic R = NOR( NOR( ሚS, X), NOR(S, Y) ) Step-1: X = NOR(0, X) Step-2: Y = NOR(0, Y) Step-3: ሚS = NOR(0, S) Step-4: tmp1 = NOR( ሚS, X) Step-5: tmp2 = NOR(S, Y) Step-6: R = NOR(tmp1,tmp2) Step-7: R = NOR(0, R) X Y S!X!Y!S!(!X+!S)!(!Y+S)!R R 4

5 ... Cycles Cells Cells Cells DRAM in-situ Accelerator Rocks Challenges Bitline NOR SA... DRAM Run Boolean logic operation in SERIAL. For example: R = S X + ሚS Y DRAM-Acc s Big Challenge: Very Slow Multiplications! NOR-only logic X Y S 143 cycles!x for 8-bit MUL! 1.0E+3 1.0E+2 1.0E+1 MUL SC R = NOR(0, R)!Y!S!(!X+!S)!(!Y+S) 1.0E+0!R R = NOR( NOR( ሚS, X), NOR(S, Y) ) Step-7: Binary operand bit length R 4

6 The Opportunity: Stochastic Computing Stochastic Computing (SC): A different data representation and arithmetic (like INT vs. FP) Bitstream representation: value = possibility of appearance of 1 # 1 s = 3 #bits = 6 # 1 s = 2 #bits = 6 5

7 The Opportunity: Stochastic Computing Stochastic Computing (SC): A different data representation and arithmetic (like INT vs. FP) Bitstream representation: value = possibility of appearance of 1 Binary Long SC bitstream & & & & & & MUL Simple bitwise AND! 5

8 Thr/Area (Muls/ns/nm 2 ) Thr/W (Muls/W) 2.94E E+12 The Good: But not a Free Lunch MUL AND, reducing MUL latency by 47x. The Bad: Hurt throughput & energy efficiency Exp-long bitstream (8 256), intensive BW and Capacity demands Stoc-and-Binary conversion overhead The Ugly: The Ugly: Numerical precision loss Unreproducible error, no debug 1.E+4 1.E+3 1.E+2 1.E+1 1.E+0 (a). Stoch. Bin 1.E+13 1.E+12 1.E+11 1.E+10 (b). SC-MUL area AND 1% PC 4% SN G 95% Stoch. Bin 6

9 Cycles Key Idea: Combining DRAM-Acc with SC Stochastic 1.0E+3 1.0E+2 1.0E+1 MUL SC Computing MUL AND 1.0E Binary operand bit length DRAM-based In-situ Acc. 7

10 Key Idea: Combining DRAM-Acc with SC Stochastic Computing MUL AND BIN(5) = DRAM-based In-situ Acc. SC(5) =

11 Key Idea: Combining DRAM-Acc with SC Stochastic Computing MUL AND DRAM-based In-situ Acc. In-DRAM comp. H 2 D Arithmetic 7

12 Related Work BL-level in memory computing architecture is hot. AMBIT[MICRO 17], DIRSA[MICRO 17],Compute Caches[HPCA 17] Stochastic computing is well study since 1960s. Showing promising results on DNN workloads J. Dickson[ICNN 93], K.Kim[DAC 16], DSCNN[ASPLOS 16], DPS[DAC 18], S.K. Khatamifard[CAL 18] This is the first work combines them together. Putting together, they synergistically reinforce the strengths and address the weaknesses of each other. 8

13 controller Decoder Improved Controller Stoch. Num. Generator (SNG) Overview of the Architecture Subarray Data Cell Region BL... Bank Subarray Comping Cell Region 2 Computational SA 3 BL logic 4 Register Bank SNG (optional) Popcount (optional) 5 Simple Shifters (b) Bank (c) Subarray Building upon BL-level in-dram computing architecture. Adding Stochastic number generator (SNG) and popcount (PC), and Improving controller and BL logic design 9

14 MuLs/Cycle/area Can we do better: Introducing H 2 D Arithmetic MUL AND In-DRAM comp. H 2 D Arithmetic 1.0E+3 1.0E+1 1.0E-1 1.0E-3 1.0E-5 1.0E-7 Exp-long bitstream. SNG overhead. Precision loss. Unreproducible error. Only 1.5x throughput! MUL Binary operand bit length SC 10

15 H 2 D Arithmetic-1: Hierarchical Representation Observation: trick for O(2 n ), change 2 n to (2 n/2 + 2 n/2 ). Converting MSB-part and LSB-part separately! Example: Binary: (n width) SC bitstream (2 n -1 width) Exp-long bitstream. SNG overhead. Precision loss. Unreproducible error. [1,0,0,1] [1,1,1,0,1,0,1,0,1,1,0,1,0,1,0,1] Binary: [MSBs,LSBs] [1,0,0,1] SC bitstream: [1,0,1],[0,1,0] ( (2 n/2-1)*2 width) 11

16 H 2 D Arithmetic-2: Hybrid Binary-Stochastic Observation: Unnecessarily redundant representation, DRISA is fast at 2-bit MUL. Encoding part of SC as BIN, hybrid SC-BIN! SC bitstream: sub-stream: Count #-of- 1 in each substream: Hybrid-SC: >intra sub-stream is BIN >whole stream is still SC [1,1,1,0,1,0,1,0,1,0,0,0,0,0,1] [ 3, 1, 2, 0, 1 ] [ 11, 01, 10, 00, 01 ] Exp-long bitstream. SNG overhead. Precision loss. Unreproducible error. 2 n -1 width (2 n -1)/3*2 width 12

17 H 2 D Arithmetic-3: Deterministic SNG Observation: Randomness is used to ensure low correlation, We only do one OP in SC domain anyway, Real random bitstream is unnecessary. Exp-long bitstream. SNG overhead. Precision loss. Unreproducible error. LUT-based Stochastic Number Generator: offset 1 s 0 s Operand X (5/16): Operand Y (5/16): [ 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0 ] [ 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0 ] periodically 13

18 Norm. Error RMS 0% -33% 275% -23% -25% -60% H 2 D Arithmetic: Putting Them Together Exp-long bitstream. SNG overhead. Precision loss. Unreproducible error H 2 D further increases the throughput by 6x H 2 D improves precision by 60% baseline Hybrid Hier. Deter. w.o.apc H2D-w.PC 14

19 Norm. Perf./Area Experiments: A Case Study on DNN 10 SCOPE-vanilla SCOPE-hyb SCOPE-hier SCOPE-H2D GPU-INT8 (All Norm. to DRISA) vgg resnet rnn Alex-train gmean SCOPE w/ H 2 D: 2.3x than DRISA, 11.6x than w/o H 2 D 15

20 More In the Paper Architecture design detail. H2D design detail. DNN case study detail. More experiments. Come to our poster! 16

21 Summary Stochastic Computing Stochastic Computing each other. In-DRAM Computing Architecture: Synergistically reinforce the strengths and address the weaknesses of MUL AND In-situ Acc. In-DRAM comp. H 2 D Arithmetic Three DRAM-based arithmetic techniques (H 2 D) to further improve performance, and solve precision problems. 17

22 Thanks! Questions SCOPE: A Stochastic Computing Engine for DRAM-based In-situ Accelerator Shuangchen Li, Alvin Oliver Glova, Xing Hu, Peng Gu, Dimin Niu*, Krishna T. Malladi*, Hongzhong Zheng*, Bob Brennan*, and Yuan Xie University of California, Santa Barbara *Memory Solutions Lab, Samsung Semiconductor Inc.

DRISA: A DRAM-based Reconfigurable In-Situ Accelerator

DRISA: A DRAM-based Reconfigurable In-Situ Accelerator DRI: A DRAM-based Reconfigurable In-Situ Accelerator Shuangchen Li, Dimin Niu, Krishna T. Malladi, Hongzhong Zheng, Bob Brennan, Yuan Xie University of California, Santa Barbara Memory Solutions Lab, Samsung