DRISA: A DRAM-based Reconfigurable In-Situ Accelerator

Size: px

Start display at page:

Download "DRISA: A DRAM-based Reconfigurable In-Situ Accelerator"

Magdalen Berry
5 years ago
Views:

1 DRI: A DRAM-based Reconfigurable In-Situ Accelerator Shuangchen Li, Dimin Niu, Krishna T. Malladi, Hongzhong Zheng, Bob Brennan, Yuan Xie University of California, Santa Barbara Memory Solutions Lab, Samsung Semiconductor Inc. Scalable and Energy-efficient Architecture Lab (SEAL) SEAL@UCSB

2 Normalized On-chip Mem.Capacity per Area Scalable and Energy-efficient Architecture Lab (SEAL) Motivation and Observation 1.E+03 Merging the computing resources and memory fabrics 1.E+02 1.E+01 1.E+00 1E+00 1E+01 1E+02 1E+03 1E+04 Normalized Peak Perf. per Area 2

3 Normalized On-chip Mem.Capacity per Area Scalable and Energy-efficient Architecture Lab (SEAL) Motivation and Observation 1.E+03 1.E+02 Merging the computing resources and memory fabrics Memory-rich processor: low memory capacity 1.E+01 Memory-rich Processor Dadiannao Shidiannao (ASICs) TITAN X (GPU) 1.E+00 1E+00 1E+01 1E+02 1E+03 1E+04 Normalized Peak Perf. per Area 2

4 Normalized On-chip Mem.Capacity per Area Scalable and Energy-efficient Architecture Lab (SEAL) Motivation and Observation 1.E+03 1.E+02 1.E+01 BufferedComp NeuroCube Compute-capable Memory (PIM) Memory-rich Processor Dadiannao Shidiannao (ASICs) TITAN X (GPU) 1.E+00 1E+00 1E+01 1E+02 1E+03 1E+04 Normalized Peak Perf. per Area Merging the computing resources and memory fabrics Memory-rich processor: low memory capacity Compute-capable memory: low performance 2

5 Normalized On-chip Mem.Capacity per Area Scalable and Energy-efficient Architecture Lab (SEAL) Motivation and Observation 1.E+03 1.E+02 1.E+01 BufferedComp NeuroCube Compute-capable Memory (PIM) Memory-rich Processor This Work Dadiannao Shidiannao (ASICs) TITAN X (GPU) 1.E+00 1E+00 1E+01 1E+02 1E+03 1E+04 Normalized Peak Perf. per Area Merging the computing resources and memory fabrics Memory-rich processor: low memory capacity Compute-capable memory: low performance 2

6 Normalized On-chip Mem.Capacity per Area Scalable and Energy-efficient Architecture Lab (SEAL) Motivation and Observation 1.E+03 1.E+02 1.E+01 BufferedComp NeuroCube Compute-capable Memory (PIM) Memory-rich Processor This Work Dadiannao Shidiannao (ASICs) TITAN X (GPU) 1.E+00 1E+00 1E+01 1E+02 1E+03 1E+04 Normalized Peak Perf. per Area Merging the computing resources and memory fabrics Memory-rich processor: low memory capacity Compute-capable memory: low performance To have BOTH: (1) Use DRAM technology (2) Remove sys-memory constraints Building an accelerator with DRAM technology 2

7 Key Ideas and Approaches To have BOTH: (1) Use DRAM technology (2) Remove sys-memory constraints Building an accelerator with DRAM technology 3

8 Key Ideas and Approaches To have BOTH: (1) Use DRAM technology (2) Remove sys-memory constraints DRAM technology Building an accelerator with DRAM technology 3

9 Key Ideas and Approaches To have BOTH: (1) Use DRAM technology (2) Remove sys-memory constraints DRAM technology Logic Incompatible Building an accelerator with DRAM technology 3

10 Key Ideas and Approaches To have BOTH: (1) Use DRAM technology (2) Remove sys-memory constraints Building an accelerator with DRAM technology DRAM technology Logic Incompatible Simple Boolean logic Operation Cells Bitline NOR 3

11 Key Ideas and Approaches To have BOTH: (1) Use DRAM technology (2) Remove sys-memory constraints Building an accelerator with DRAM technology DRAM technology Logic Incompatible Simple Boolean logic Operation General Purpose Reconfigurable Cells Bitline NOR SHIFT 3

12 Key Ideas and Approaches To have BOTH: (1) Use DRAM technology (2) Remove sys-memory constraints Building an accelerator with DRAM technology DRAM technology Logic Incompatible Simple Boolean logic operations General Purpose High Pref. Reconfigurable Improve Parallelism Cells Bitline Multisubarray active Unblock Data Mov. Optimize Activation NOR SHIFT Multi-bank active 3

13 Architecture Overview Group Bank Bank Bank Bank Group Group (a) Chip DRAM modifications: 4

14 bctrl Scalable and Energy-efficient Architecture Lab (SEAL) Architecture Overview Group Bank Bank Mat Mat Bank Bank Group Group Subarry (a) Chip DRAM modifications: (b) Bank 4

15 bctrl Scalable and Energy-efficient Architecture Lab (SEAL) Architecture Overview Group Bank Bank Bank Bank Mat Mat sctrl DRAM Cells supports Boolean logic operations Group Group Subarry Shifter (a) Chip DRAM modifications: (b) Bank (c) Subarray and mat 4

16 bctrl Scalable and Energy-efficient Architecture Lab (SEAL) Architecture Overview Group Bank Bank Bank Bank Mat Mat sctrl DRAM Cells supports Boolean logic operations Group Group Subarry Shifter (a) Chip DRAM modifications: Change decoders to controllers (b) Bank (c) Subarray and mat 4

17 bctrl Scalable and Energy-efficient Architecture Lab (SEAL) Architecture Overview Group Bank Bank Bank Bank Mat Mat sctrl DRAM Cells supports Boolean logic operations Group Group Subarry Shifter (a) Chip (b) Bank DRAM modifications: Change decoders to controllers Change to support logic operations (c) Subarray and mat 4

bctrl Scalable and Energy-efficient Architecture Lab (SEAL) Architecture Overview Group Bank Bank Bank Bank Mat Mat sctrl DRAM Cells supports Boolean logic operations

18 bctrl Scalable and Energy-efficient Architecture Lab (SEAL) Architecture Overview Group Bank Bank Bank Bank Mat Mat sctrl DRAM Cells supports Boolean logic operations Group Group Subarry Shifter (a) Chip (b) Bank DRAM modifications: Change decoders to controllers Change to support logic operations Add shifters (c) Subarray and mat 4

19 bctrl Architecture Overview Group Group Bank Bank Bank Bank Group Mat Subarry Mat sctrl DRAM Cells supports Boolean logic operations Shifter (a) Chip (b) Bank DRAM modifications: Change decoders to controllers Change to support logic operations Add shifters (c) Subarray and mat Others: Group/Bank buffers helps internal data transfer, Bank/Subarray reorganization, Spitted cell array regions 4

20 Make BL Be Able To Compute (1/2) Three solutions: Cells Bitline NOR SHIFT 5

21 Make BL Be Able To Compute (1/2) Three solutions: 3T1C: natural NOR on BL NOR Cells Bitline SHIFT 3T1C-NOR Rs Rt Rr wbl rwl wwl rbl 5

22 Make BL Be Able To Compute (1/2) Three solutions: 3T1C: natural NOR on BL 1T1C: adds gates or adopting AMBIT s methods NOR Cells Bitline SHIFT 3T1C-NOR Rs rwl Rt wwl Rr rbl wbl Rs Rt Rr 1T1C-NOR/MIX and or Pre-load <0.5 > Or Rs Rt Rr logic gate latch 5

Make BL Be Able To Compute (1/2) Three solutions: 3T1C: natural NOR on BL 1T1C: adds gates or adopting AMBIT s methods 1T1C-adder: adds full-adders to BL NOR Cells Bitline SHIFT

23 Make BL Be Able To Compute (1/2) Three solutions: 3T1C: natural NOR on BL 1T1C: adds gates or adopting AMBIT s methods 1T1C-adder: adds full-adders to BL NOR Cells Bitline SHIFT 3T1C-NOR Rs rwl Rt wwl Rr rbl wbl Rs Rt Rr 1T1C-NOR/MIX and or Pre-load <0.5 > Or Rs Rt Rr logic gate latch 1T1C-ADDER Rs Rt Rr latches n-bit adder

24 Make BL Be Able To Compute (2/2) Example: selector R = (S == 1)? X: Y Cells Bitline NOR SHIFT 6

25 Make BL Be Able To Compute (2/2) Example: selector R = (S == 1)? X: Y Cells Bitline R = S X + ሚS Y NOR SHIFT 6

26 Make BL Be Able To Compute (2/2) Example: selector R = (S == 1)? X: Y Cells Bitline R = S X + ሚS Y NOR-only logic R = NOR( NOR( ሚS, X), NOR(S, Y) ) NOR SHIFT 6

27 Make BL Be Able To Compute (2/2) Example: selector R = (S == 1)? X: Y Cells Bitline R = S X + ሚS Y NOR-only logic X Y S NOR SHIFT R = NOR( NOR( ሚS, X), NOR(S, Y) ) 6

28 Make BL Be Able To Compute (2/2) Example: selector R = (S == 1)? X: Y Cells Bitline R = S X + ሚS Y NOR-only logic R = NOR( NOR( ሚS, X), NOR(S, Y) ) X Y S!X NOR SHIFT Step-1: X = NOR(0, X) 6

29 Make BL Be Able To Compute (2/2) Bitline Example: selector R = (S == 1)? X: Y Cells R = S X + ሚS Y NOR-only logic R = NOR( NOR( ሚS, X), NOR(S, Y) ) X Y S!X!Y NOR SHIFT Step-1: Step-2: X = NOR(0, X) Y = NOR(0, Y) 6

30 Make BL Be Able To Compute (2/2) Bitline Example: selector R = (S == 1)? X: Y Cells R = S X + ሚS Y NOR-only logic R = NOR( NOR( ሚS, X), NOR(S, Y) ) Step-1: X = NOR(0, X) X Y S!X!Y!S NOR SHIFT Step-2: Step-3: Y = NOR(0, Y) ሚS = NOR(0, S) 6

31 Make BL Be Able To Compute (2/2) Example: selector R = (S == 1)? X: Y Cells Bitline R = S X + ሚS Y NOR-only logic R = NOR( NOR( ሚS, X), NOR(S, Y) ) Step-4: tmp1 = NOR( ሚS, X) X Y S!X!Y!S!(!X+!S) NOR SHIFT 6

32 Make BL Be Able To Compute (2/2) Example: selector R = (S == 1)? X: Y Cells Bitline R = S X + ሚS Y NOR-only logic R = NOR( NOR( ሚS, X), NOR(S, Y) ) Step-4: Step-5: tmp1 = NOR( ሚS, X) tmp2 = NOR(S, Y) X Y S!X!Y!S!(!X+!S)!(!Y+S) NOR SHIFT 6

33 Make BL Be Able To Compute (2/2) Bitline Example: selector R = (S == 1)? X: Y Cells R = S X + ሚS Y NOR-only logic R = NOR( NOR( ሚS, X), NOR(S, Y) ) Step-4: Step-5: Step-6: tmp1 = NOR( ሚS, X) tmp2 = NOR(S, Y) R = NOR(tmp1,tmp2) X Y S!X!Y!S!(!X+!S)!(!Y+S)!R NOR SHIFT 6

34 Make BL Be Able To Compute (2/2) Example: selector R = (S == 1)? X: Y Cells Bitline R = S X + ሚS Y NOR-only logic R = NOR( NOR( ሚS, X), NOR(S, Y) ) Step-7: R = NOR(0, R) X Y S!X!Y!S!(!X+!S)!(!Y+S)!R NOR SHIFT R 6

35 Why include shifters: E.g., carry-in propagation Shifters (1/2) NOR Cells Bitline SHIFT 7

36 Why include shifters: E.g., carry-in propagation Shifters (1/2) Cells Bitline X 1 Y 1 X 0 Y 0 NOR SHIFT C in0 7

37 Why include shifters: E.g., carry-in propagation Shifters (1/2) Cells Bitline X 1 Y 1 X 0 Y 0 NOR SHIFT C in0 S 0 7

38 Why include shifters: E.g., carry-in propagation Shifters (1/2) Cells Bitline X 1 Y 1 X 0 Y 0 NOR SHIFT C in0 S 0 C out0 7

39 Why include shifters: E.g., carry-in propagation X 1 Y 1 Y 0 Y 1 X 0 Shifters (1/2) NOR Cells Bitline SHIFT C in1 C in0 S 0 C out0 7

40 Multiple hierarchies: Shifters (2/2) Cells Bitline NOR SHIFT 8

41 Multiple hierarchies: Shifters (2/2) Intra-lane: bit shift inside 8 bit lane NOR Cells Bitline SHIFT Virtual lane (INT8) Virtual lane (INT8) 8

42 Multiple hierarchies: Shifters (2/2) Intra-lane: bit shift inside 8 bit lane Inter-lane: array element shift NOR Cells Bitline SHIFT Virtual lane (INT8) Virtual lane (INT8) 8

43 Multiple hierarchies: Shifters (2/2) Intra-lane: bit shift inside 8 bit lane Inter-lane: array element shift Forwarding: access any element in the array NOR Cells Bitline SHIFT Virtual lane (INT8) Virtual lane (INT8) 8

44 Cycles Scalable and Energy-efficient Architecture Lab (SEAL) Putting Compute-capable BLs and Shifters Together C FA Observations: Operand bit length C is preferred: reduction works fine 9

45 Cycles Scalable and Energy-efficient Architecture Lab (SEAL) Cycles Putting Compute-capable BLs and Shifters Together 40 C FA Operand-2 bit length = 2 bit Operand bit length Operand-1 bit length 1 Observations: C is preferred: reduction works fine Affordable MUL: need to have one operand within 2-bit 9

46 Optimizations for high performance 10

47 Optimizations for high performance DRAM technology Logic Incompatible Simple Boolean logic + Serially run General Purpose High Pref. Reconfigurable 10

48 Optimizations for high performance DRAM technology Logic Incompatible Simple Boolean logic + Serially run General Purpose High Pref. Reconfigurable 10

49 Normalized On-chip Mem.Capacity per Area DRAM technology Optimizations for high performance 1.E+03 Logic Incompatible Simple Boolean logic + Serially run General Purpose High Pref. Reconfigurable 1.E+02 Compute-capable Memory (PIM) Adopting commodity DRAM: 13-cycles for 8-bit C 1.E+01 1.E+00 Memory-rich Processor 1E+00 1E+01 1E+02 1E+03 1E+04 Normalized Peak Perf. per Area trc (46ns) 10

50 Normalized On-chip Mem.Capacity per Area DRAM technology Optimizations for high performance 1.E+03 Logic Incompatible Simple Boolean logic + Serially run General Purpose High Pref. Reconfigurable 1.E+02 un-optimized Compute-capable Memory (PIM) Adopting commodity DRAM: 13-cycles for 8-bit C 1.E+01 1.E+00 Memory-rich Processor 1E+00 1E+01 1E+02 1E+03 1E+04 Normalized Peak Perf. per Area trc (46ns) 10

51 Normalized On-chip Mem.Capacity per Area DRAM technology Logic Incompatible Simple Boolean logic + Serially run Optimizations for high performance General Purpose High Pref. Reconfigurable Improve Parallelism Unblock Data Mov. Optimize Activation Adopting commodity DRAM: 13-cycles for 8-bit C 1.E+03 1.E+02 1.E+01 1.E+00 un-optimized Compute-capable Memory (PIM) Target Memory-rich Processor 1E+00 1E+01 1E+02 1E+03 1E+04 Normalized Peak Perf. per Area trc (46ns) 10

52 Experiment Setup DRI circuit simulator: NN topology Heavily modified CACTI Digital circuit (controller, logic gates) From Design Compiler synthesis Scaled to DRAM process with 20% perf. Overhead and 80% area overhead (ISCAS 99) DRI performance simulator: A behavior-level simulator Including a mapping optimization framework Mapping scheme Design options # mat/ subarr y/bank Devise parameter Design options Performance Simulator [In-house] Latency/ cycles Circuit Simulator [DesignCompiler+ CACTI-3DD] Circuits Power/ops Speed Power Area Leakage 11

53 Perf/Area (fr./s/mm2) Scalable and Energy-efficient Architecture Lab (SEAL) Binary weight, 8-bit activation CNN inference case study 1E+02 1E+01 3T1C 1T1C-mixed GPU-INT 1T1C-nor 1T1C-adder 1E+00 1E-01 1E AlexNet vgg-16 vgg-19 resnet-152 GM 12

54 Perf/Area (fr./s/mm2) Scalable and Energy-efficient Architecture Lab (SEAL) Binary weight, 8-bit activation CNN inference case study 1E+02 1E+01 3T1C 1T1C-mixed GPU-INT 1T1C-nor 1T1C-adder 1E+00 1E-01 1E AlexNet vgg-16 vgg-19 resnet-152 GM 12

55 Perf/Area (fr./s/mm2) Scalable and Energy-efficient Architecture Lab (SEAL) Binary weight, 8-bit activation CNN inference case study 1E+02 1E+01 3T1C 1T1C-mixed GPU-INT 1T1C-nor 1T1C-adder 1E+00 1E-01 1E AlexNet vgg-16 vgg-19 resnet-152 GM 12

56 Perf/Area (fr./s/mm2) Scalable and Energy-efficient Architecture Lab (SEAL) Binary weight, 8-bit activation CNN inference case study 1E+02 1E+01 3T1C 1T1C-mixed GPU-INT 1T1C-nor 1T1C-adder 1E+00 1E-01 1E AlexNet vgg-16 vgg-19 resnet-152 GM 12

57 Perf/Area (fr./s/mm2) Scalable and Energy-efficient Architecture Lab (SEAL) Binary weight, 8-bit activation CNN inference case study 1E+02 1E+01 3T1C 1T1C-mixed GPU-INT 1T1C-nor 1T1C-adder 1E+00 1E-01 1E AlexNet vgg-16 vgg-19 resnet-152 GM 12

58 Perf/Area (fr./s/mm2) Scalable and Energy-efficient Architecture Lab (SEAL) Binary weight, 8-bit activation CNN inference case study 1E+02 1E+01 3T1C 1T1C-mixed GPU-INT 1T1C-nor 1T1C-adder 1E+00 1E-01 1E AlexNet vgg-16 vgg-19 resnet-152 GM 12

59 Perf/Area (fr./s/mm2) Scalable and Energy-efficient Architecture Lab (SEAL) 1E+02 1E+01 Binary weight, 8-bit activation CNN inference case study 3T1C 1T1C-nor 1T1C-mixed 1T1C-adder GPU-INT 3T1C is not good The lowest area overhead Large memory cells 1E+00 1E-01 1E AlexNet vgg-16 vgg-19 resnet-152 GM 12

60 Perf/Area (fr./s/mm2) Scalable and Energy-efficient Architecture Lab (SEAL) 1E+02 1E+01 1E+00 1E-01 1E-02 Binary weight, 8-bit activation CNN inference 3T1C 1T1C-mixed GPU-INT case study 1T1C-nor 1T1C-adder AlexNet vgg-16 vgg-19 resnet-152 GM 3T1C is not good The lowest area overhead Large memory cells 1T1C-adder is not the best The best peak performance Low effective performance 1T1C-mixed is the best solution 12

61 More in the paper Microarchitectures of BL-logic operations and shifter Interface design Optimizations for high performance Impact of variation CNN mapping and optimizations Detail experiment setup and more results 13

62 Summary In-situ computing: building an accelerator with DRAM technology DRAM for large memory capacity BL-computing logic design + Shifter for general purpose instructions Optimized for high computing performance Experiments on binary CNN acceleration: perf. per area 8.8x than ASIC,7.7x than GPU energy efficiency per area: 1.2x than ASIC, 15x than GPU NOR Cells Bitline SHIFT Multisubarray active Multi-bank active 14

63 Questions? DRI: A DRAM-based Reconfigurable In-Situ Accelerator Shuangchen Li, Dimin Niu, Krishna T. Malladi, Hongzhong Zheng, Bob Brennan, Yuan Xie University of California, Santa Barbara Memory Solutions Lab, Samsung Semiconductor Inc. Scalable and Energy-efficient Architecture Lab (SEAL) SEAL@UCSB

SCOPE: A Stochastic Computing Engine for DRAM-based In-situ Accelerator

SCOPE: A Stochastic Computing Engine for DRAM-based In-situ Accelerator Shuangchen Li, Alvin Oliver Glova, Xing Hu, Peng Gu, Dimin Niu*, Krishna T. Malladi*, Hongzhong Zheng*, Bob Brennan*, and Yuan Xie