Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Size: px

Start display at page:

Download "Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks"

Darren Newton
5 years ago
Views:

Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis

1 Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research Group

2 Can we transform CPU into a neural accelerator? CPU $ GPU 2

3 Can we transform CPU into a neural accelerator? CPU Neural Cache GPU ++ Parallelism -- Data Movement 3

4 Transforming caches into massively parallel vector ALUs 8-core Xeon processor 45 MB LLC 8 LLC slices 4

5 Way Way 2 Way 9 Way 2 Transforming caches into massively parallel vector ALUs 8-core Xeon processor 45 MB LLC 2.5MB LLC slice TMU CBOX 32kB data bank 8kB array 8 LLC slices 36 ways 5

6 Way Way 2 Way 2 Way 9 Transforming caches into massively parallel vector ALUs 8-core Xeon processor 45 MB LLC 2.5MB LLC slice BL/BLB 8kB SRAM array WL TMU CBOX Row decoder 32kB data bank 8kB array 8 LLC slices 36 ways 576 arrays 6

7 Way Way 2 Way 2 Way 9 A + B Array B Array A Transforming caches into massively parallel vector ALUs 8-core Xeon processor 45 MB LLC 2.5MB LLC slice TMU CBOX Row decoders 8kB SRAM array BL/BLB Bit-Slice 3 Bit-Slice 2 Bit-Slice Bit-Slice Bit-Slice 3 Bit-Slice 2 Bit-Slice Bit-Slice WL 32kB data bank 8kB array Logic = A + B 8 LLC slices 36 ways 576 arrays 7

8 Way Way 2 Way 2 Way 9 A + B DR Array B Array A Transforming caches into massively parallel vector ALUs 8-core Xeon processor 45 MB LLC 32kB data bank 2.5MB LLC slice TMU CBOX 8kB array Row decoders 8kB SRAM array BL/BLB Bit-Slice 3 Bit-Slice 2 Bit-Slice Bit-Slice Bit-Slice 3 Bit-Slice 2 Bit-Slice Bit-Slice Bitline ALU BL A&B Cout EN C D Q Cin A^B BLB S S = A^B^C 8 LLC slices 36 ways 576 arrays,474,56 ALUs Logic = A + B WL 8 ~A & ~B C_EN

9 Way Way 2 Way 2 Way 9 DR Transforming caches into massively parallel vector ALUs 8-core Xeon processor 45 MB LLC 2.5MB LLC slice TMU CBOX 8kB SRAM array BL/BLB Array A Array B Row Passive Last Level Cache transformed decoders into million bit-serial active A&B ALUs ~A & ~B A + B Multiply Divide Add WL Bitline ALU BL A^B BLB 32kB data bank Configurable Precision 8kB array Logic Bit-serial GHz = A + B Cout EN C D Q Cin S S = A^B^C 8 LLC slices 36 ways 576 arrays,474,56 ALUs 9 C_EN

10 Why bit-serial? A + B BL/BLB Bit-parallel arithmetic Row decoders Logic

11 Why bit-serial? A + B BL/BLB Word 3 Word 2 Word Word Array A Bit-parallel arithmetic Row decoders Word 3 Word 2 Word Word Array B A + B Logic

12 Why bit-serial? A + B Bit-parallel arithmetic Row decoders BL/BLB Word 3 Word 2 Word Word Word 3 Word 2 Word Word WL WL2 Array A Array B A + B Logic S 2

13 Why bit-serial? A + B Bit-parallel arithmetic Row decoders BL/BLB Word 3 Word 2 Word Word Word 3 Word 2 Word Word WL WL2 Array A Array B A + B Logic S S C Carry propagation across bitlines 3

14 Why bit-serial? A + B Bit-parallel arithmetic Row decoders BL/BLB Word 3 Word 2 Word Word Word 3 Word 2 Word Word WL WL2 Array A Array B A + B Logic S S S C C Carry propagation across bitlines 4

15 Why bit-serial? A + B Bit-parallel arithmetic Row decoders!! BL/BLB Word 3 Word 2 Word Word High complexity Word 3 Word 2 Word Word WL WL2 Array A Array B Loss of throughput and efficiency A + B Logic S S S S C C C Carry propagation across bitlines 5

16 Why bit-serial? A + B BL/BLB Bit-serial arithmetic Row decoders Logic 6

17 Why bit-serial? Word 3 Word 2 Word Word Array A Array B A + B A + B Transposed data BL/BLB Bit-serial arithmetic Row decoders Sum Carry S S S S 7

18 Why bit-serial? Word 3 Word 2 Word Word Array A Array B A + B A + B Transposed data BL/BLB Bit-Slice 3 Bit-Slice 2 Bit-Slice Bit-Slice WL Bit-serial arithmetic Row decoders WL2 Sum Carry S S S S Cycle 8

19 Why bit-serial? Word 3 Word 2 Word Word Array A Array B A + B A + B Transposed data BL/BLB Bit-Slice 3 Bit-Slice 2 Bit-Slice Bit-Slice WL Bit-serial arithmetic Row decoders WL2 Sum S S S S Carry C C C C Cycle 2 9

20 Why bit-serial? Word 3 Word 2 Word Word Array A Array B A + B A + B Transposed data BL/BLB Bit-Slice 3 Bit-Slice 2 Bit-Slice Bit-Slice WL Bit-serial arithmetic Row decoders WL2 Sum S S S S Carry C C C C Cycle 3 2

21 Why bit-serial? Word 3 Word 2 Word Word Array A Array B A + B Bit-serial arithmetic Transposed data Row decoders A + B BL/BLB Bit-Slice 3 Bit-Slice 2 Bit-Slice Bit-Slice Low area complexity High throughput WL WL2 Configurable & High precision Sum Carry S S S S C C C C Cycle 4 2

22 Outline Motivation Bit-Serial Arithmetic Transpose Mapping of Convolution to Array Methodology Results 22

23 Way Way 2 In-SRAM Arithmetic Way 2 Way 9 Array B A + B Array A DR 8-core Xeon processor 45 MB LLC 32kB data bank 2.5MB LLC slice TMU CBOX 8kB array Row decoders 8kB SRAM array BL/BLB Logic Bit-Slice 3 Bit-Slice 2 Bit-Slice Bit-Slice Bit-Slice 3 Bit-Slice 2 Bit-Slice Bit-Slice = A + B WL Bitline ALU BL A&B Cout EN C D Q Cin A^B BLB S S = A^B^C 8 LLC slices 36 ways 576 arrays,474,56 ALUs 23 ~A & ~B C_EN

24 Row Decoder-O Row Decoder Logical Operations In-SRAM Changes Bitlines BLB BL BLBn BLn Additional row decoder Wordlines Reconfigurable sense amplifiers Differential Sense Amplifiers Single-ended Sense Amplifiers 24

25 Row Decoder Row Decoder Logical Operations In-SRAM A AND B A B BLB BL BLBn BLn A B Single-ended Sense Amplifiers A AND B 25

26 Row Decoder Row Decoder Logical Operations In-SRAM A B BLB BL BLBn BLn A B Single-ended Sense Amplifiers A NOR B A AND B 26

27 DR Row Decoder Row Decoder A BP Addition In-SRAM 256 Bitlines BLB BL BLBn BLn A A B B BL BLB Carry Sum P P P 2 A&B Cout EN C D Q Cin A^B ~A & ~B S S = A^B^C C_EN 27

28 Row Decoder Row Decoder A BP Addition [Cycle ] BLB BL BLBn BLn A A B B P P P 2 Carry Sum 28

29 Row Decoder Row Decoder A BP Addition [Cycle 2] BLB BL BLBn BLn A A B B P P P 2 Carry Sum 29

30 Row Decoder Row Decoder Addition [Cycle 3] P BLB BL BLBn BLn A A B B P P P 2 Carry Sum 3

31 Row Decoder Row Decoder Multiplication In-SRAM BLB BL BLBn BLn A A B B P P P 2 P 3 Carry Sum Tag 3

32 Row Decoder Row Decoder Multiplication [Cycle ] BLB BL BLBn BLn A B A B P 2 A A X B B A B A B P P A A B B P P P 2 P 3 Carry Sum Tag 32

33 Row Decoder Row Decoder Multiplication [Cycle 2] BLB BL BLBn BLn A B A B P 2 A A X B B A B A B P P A A B B P P P 2 P 3 P <- A B Carry Sum Tag 33

34 Row Decoder Row Decoder Multiplication [Cycle 3] BLB BL BLBn BLn A B A B P 2 A A X B B A B A B P P A A B B P P P 2 P 3 P <- A B P <- A B Carry Sum Tag 34

35 Row Decoder Row Decoder Multiplication [Cycle 4] BLB BL BLBn BLn A B A B P 2 A A X B B A B A B P P A A B B P P P 2 P 3 P <- A B P <- A B Carry Sum Tag 35

36 Row Decoder Row Decoder Multiplication [Cycle 5] BLB BL BLBn BLn A B A B P 2 A A X B B A B A B P P A A B B P P P 2 P 3 P <- A B P <- A B + A B P <- P + A B If(B ), P <- P + A Else, P <- P Carry Sum Tag 36

37 Row Decoder Row Decoder Multiplication [Cycle 6] BLB BL BLBn BLn A B A B P 2 A A X B B A B A B P P A A B B P P P 2 P 3 P <- A B P <- A B + A B P 2 <- A B Carry Sum Tag 37

38 Row Decoder Row Decoder Multiplication [Cycle 7] BLB BL BLBn BLn A B A B P 2 A A X B B A B A B P P A A B B P P P 2 P 3 P <- A B P <- A B + A B P 2 <- A B P 3 <- C in Carry Sum Tag 38

Supported Arithmetic Operation Cycles ADD N+ SUB 2N+ MUL N 2 + 5N -2 DIV.5N 2 + 5.

39 Supported Arithmetic Operation Cycles ADD N+ SUB 2N+ MUL N 2 + 5N -2 DIV.5N N Comparison 2N+ Synthesized array 7.5% area overhead Processor Chip 2% area overhead 39

40 Outline Motivation Bit-Serial Arithmetic Transpose Mapping of Convolution to Array Methodology Results 4

41 Way Way 2 Way 9 Way 2 DR DR DR DR DR DR Row Decoder TMU CBOX Transpose Control A [MSB] A [MSB] A 2 [MSB]... A [LSB] A [LSB] A 2 [LSB] Col Decoder... B [MSB] B [MSB] B 2 [MSB] B [LSB] B [LSB] B 2 [LSB] DR DR DR DR DR DR DR DR DR DR DR DR 8-T transpose bit-cell Transpose read/write Regular read/write 4

42 Transpose A 2 A A B 2 B B TMU C 2 C C A A A 2 B B B 2 C C C 2 42

43 Outline Motivation Transpose Bit-Serial Arithmetic Mapping of Convolution to Array Methodology Results 43

44 A Convolutional Layer 3D Filters (M) each filter: C channels each channel: RxS weights C Input Activations (C channels) Output Activations (M channels) R C S H W C E F M R M S 44

45 RxS RxS Output 4x8 Partial Sum 4x8 Input Activation RxSx8 Weights RxSx8 Mapping CNN to Neural Cache 8 kb SRAM Array 256 Bitlines Filter Weights C Input Activations Output Activations... R S C R M S Unroll H Unroll W C E F M Wordlines C... C... MAC Partial Sum C... Reduction Output Activation C 45

46 Partial Sum 4x8 Input Activation RxSx8 Weights RxSx8 channel channel 2 channel 3 channel 4 channel 256 Mapping CNN to Neural Cache 8 kb SRAM Array E Filter (C = 256) F M 256 Wordlines 2.5 MB LLC Slice M = 32 Output Position Output Position Quad Quad 2 Quad 3 Quad Bitlines Way Way 2 Way 3... Way 46

47 Way -8 Way 9-2 Way -8 Way 9-2 Mapping of Convolution to Array E F M Slice Slice 4 47

48 Put 2 34 Filter Input MAC Output it together + Loading Reduction Transfer 2.5 MB LLC Slice LLC Slice Filter Weights Core Input Activations Output Activations... Ring Interconnect... DRAM Core 4 LLC Slice 4 Way Way 2 Way Way 9 (Reserved) Quad Quad 2 Quad 3 Quad 4 48

49 Outline Motivation Transpose Bit-Serial Arithmetic Mapping of Convolution to Array Methodology Results 49

Evaluation Methodology DNN Models - Inception V3-8-bit weights and inputs Processor CPU (2 sockets) GPU ( card) Neural Cache Intel Xeon E5-2597 v3, 2.6GHz, 28 cores, 56 threads Nvidia Titan Xp,.

50 Evaluation Methodology DNN Models - Inception V3-8-bit weights and inputs Processor CPU (2 sockets) GPU ( card) Neural Cache Intel Xeon E v3, 2.6GHz, 28 cores, 56 threads Nvidia Titan Xp,.6GHz, 384 cuda cores On-chip memory MB 9.4 MB 2.5GHz Compute SRAM, 3292 Bit-serial ALUs 7 MB (Dual Socket) Off-chip memory 64 GB DRAM 2 GB DRAM 64 GB DRAM Profiler / Simulator (Performance) Profiler / Simulator (Energy) TensorFlow tfprof Intel RAPL Interface TensorFlow tfprof NVIDIA System Management Interface Cycle accurate simulator + C Microbench SPICE simulation + Intel RAPL Interface 5

51 Outline Motivation Transpose Bit-Serial Arithmetic Mapping of Convolution to Array Methodology Results 5

52 Throughput (Inferences / sec) Throughput Latency CPU - Xeon E5 GPU - Titan Xp Neural Cache Latency (ms) Batch Size 2.2x Improved throughput over GPU CPU GPU Neural Cache 7.7x Latency improvement over GPU 52

53 Energy (Joules) Power (Watts) Power/Energy Comparison Total Energy Avg Power CPU GPU Neural Cache 53

54 Neural Cache Summary Repurpose Cache to Data Parallel DNN Accelerator Massively Parallel Bit-Serial In-SRAM Arithmetic Data Layout for CNNs 2x 2x 2x.. over server class CPU at 2% area overhead 6x.. over server class GPU 54

55 Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research Group 55

arxiv: v1 [cs.ar] 9 May 2018

arxiv: v1 [cs.ar] 9 May 2018 To appear in the 45th ACM/IEEE International Symposium on Computer Architecture (ISCA 28) Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks arxiv:85.378v [cs.ar] 9 May 28 Charles Eckert,