Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network

Size: px

Start display at page:

Download "Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network"

Corey Burke
5 years ago
Views:

Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network Lixue Xia, Tianqi Tang, Wenqin Huangfu, Ming Cheng, Xiling Yin, Boxun Li, Yu Wang,

1 Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network Lixue Xia, Tianqi Tang, Wenqin Huangfu, Ming Cheng, Xiling Yin, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E., Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, China

2 Outline Background and Motivation The large overhead of interface circuits 1-bit Quantization to Eliminate DACs SEI Structure to Eliminate ADCs Experimental Results 2

3 CNN: Application and Performance CNN: State-of-the-art in visual recognition applications Pedestrian Detection [arxiv2015] Google Translate App Vehicle and Lane Detection [Stanford2015] Tracking [UIUC2015] 3

4 NN: Complexity The computation complexity and energy consumption increase rapidly to obtain better and better recognition accuracy AlexNet (a.k.a. CaffeNet) (2012) GoogLeNet (2015) 4

5 Energy Efficient Circuits and Systems Complexity Energy Efficiency = Energy = Operations/J = (OP/s)/ W How to improve the energy efficiency? Micro Macro Aerospace 5

6 RRAM-based Matrix-Vector Multiplication Merge Mem. & Comp. >100X Efficiency Gains ~~ 1 RRAM Cell 1 m-bit Multiplier + 1 m-bit Adder + 1 m-bit Reg. (SRAM) RRAM Crossbar Matrix-Vector Multiplication ASIC 6

7 RRAM-based Convolution The function of a convolution kernel is also vectorvector multiplication Multiple Conv kernels share the same input data Convolution kernels can be regarded as Matrix-Vector multiplication Kernel p. 7

8 Large Overhead of Interface The intermediate data between convolutional layers need to be buffered Digital data introduce large amount of ADCs and DACs More than 98% of power is consumed by interface! 8

9 Can we use 1-bit? 1-bit digital signal can be used as analog signal without high-cost interface Binary ANN (full-connected NN) has been verified [Kim ICML 2015] Can we use 1-bit intermediate data in CNN? Can 1-bit CNN really reduce the interface cost? 1-bit?? 1-bit Digital Output 9

10 Outline Background and Motivation 1-bit Quantization to Eliminate DACs SEI Structure to Eliminate ADCs Experimental Results 10

data around zero We can modify the non-linear function into

11 Intermediate Data Quantization Data analysis shows that the ReLU function used in CNN makes most of the layer output data around zero We can modify the non-linear function into a threshold processing to obtain 1-bit output data ReLU Function 11

12 Threshold Optimization Three-step greedy algorithm: Weight Re-scaling Re-scale the data distribution into [0,1] Layer-by-Layer Greedy Strategy Search different threshold for each layer to reduce the accuracy loss Use a greedy algorithm to optimize the threshold layer by layer Threshold Searching The threshold of a specific layer is searched by a brute-force method Use the 60,000 samples in Training Set to avoid over-fitting 12

13 Quantization Result Test the quantization method on three CNNs on MNIST The accuracy loss is less than 1% 13

14 Outline Background and Motivation 1-bit Quantization to Eliminate DACs SEI Structure to Eliminate ADCs Experimental Results 14

15 Can we use 1-bit? 1-bit digital signal can be used as analog signal without high-cost interface Binary ANN (full-connected NN) has been verified [Kim ICML 2015] Can we use 1-bit intermediate data in CNN? Greedy 1-bit Quantization Can 1-bit CNN really reduce the interface cost? 1-bit? 1-bit Digital Output 15

1-bit Quantization Only Eliminate DACs RRAM device cannot support high precision and signed weights 2 crossbars are needed for storing positive/negative

16 1-bit Quantization Only Eliminate DACs RRAM device cannot support high precision and signed weights 2 crossbars are needed for storing positive/negative weights 2 or more crossbars are needed when using low-precision device Pos Xbar Neg Xbar Original Output 7 5 = 2 >Threshold Don t need DAC Still need ADC 16

17 Original function of a layer SEI: Selected by Input With 1-bit input, the function are changed into We can regard the original input signal as a selection signal and obtain an extra input port for RRAM crossbar Original Interface SEI Interface 17

18 Overall Structure The extra port can provide some common information of the weights in a row: precision, signal, SEI can use single RRAM crossbar to process the computation of signed high-precision 18

19 Outline Background and Motivation 1-bit Quantization to Eliminate DACs SEI Structure to Eliminate ADCs Experimental Results 19

20 Result Reduce large amount of energy and area by eliminating interface Accuracy reduction is less than 1% 20

21 Summary Can we use 1-bit intermediate data in CNN? 1-bit Quantization Greedy quantization method to transfer the intermediate data into 1 bit and eliminate DACs Can 1-bit CNN really reduce the interface cost? SEI Using input data as selection signals to reduce the ADC cost for merging results of multiple crossbars 21

22 References [GoogLeNet]Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[j]. arxiv preprint arxiv: , [AlexNet] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[c]//advances in neural information processing systems. 2012: [FPGA16] Jiantao Qiu et al., "Going deeper with embedded fpga platform for convolutional neural network", to appear in FPGA [Chen ISSCC 2016] Y. H. Chen, T. Krishna, J. Emer and V. Sze, "14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks," 2016 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, 2016, pp [Gokhale 2014] V. Gokhale, J. Jin, A. Dundar, B. Martini and E. Culurciello, "A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks," 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, 2014, pp [Zhang 2015] Zhang C, Li P, Sun G, et al. Optimizing fpga-based accelerator design for deep convolutional neural networks[c]//proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2015: [Chakradhar 2010] Chakradhar S, Sankaradas M, Jakkula V, et al. A dynamically configurable coprocessor for convolutional neural networks[c]//acm SIGARCH Computer Architecture News. ACM, 2010, 38(3): [Kim ICML 2015] M. Kim et al., Bitwise neural networks, in ICML workshop, 2015.

23 p. 23

Binary Convolutional Neural Network on RRAM

Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua