Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua University, Beijing, China e-mail: yu-wang@mail.tsinghua.edu.cn
Outline Background & Motivation RRAM-based Binary CNN Accelerator Design System Overview Convolver Circuit Line Buffer & Pipeline Experimental Results Comparison between BCNN on RRAM and Multi-bit CNN on RRAM Recognition Accuracy under Device Variation Area and Energy Cost Saving Conclusion 2
Convolutional Neural Network Popular in Recent Ten Years, Good Performance for Wide Ranges of Applications Generalized Recognition (CNN) Object Detection & Localization Image Caption Specialized Recognition (CNN) Lane Detection & Vehicle Detection Pedestrian Detection Face Recognition Besides Vision Tasks (CNN + other) 3 Speech Recognition Natural Language Processing Chess & Go (CNN + RNN) (CNN + LSTM) (CNN + Reinforcement Learning)
Convolutional Neural Network Popular in Recent Ten Years, Good Performance for Wide Ranges of Applications Winners of the Image-Net Large-Scale Visual Recognition Challenge (ILSVRC) Task: Classification & Localization with Provided Data 2011 2012 2013 2014 2015 2016 Team XRCE SuperVision Clarifai VGG MSRA Trimps-Soushen Model Not CNN AlexNet ZF VGG-16 ResNet-152 Ensemble Err (Top-5) 25.8% 16.4% 11.7% 7.4% 3.57% 2.99% 2016 2014 2014 For the Current Popular CNN Model (ResNet Series), <5% Top-5 Error Rate 10-50MWeights for Single Model >10G Operations per Inference 2012 A High Energy Efficiency Platform is Required for CNN Applications! 4 [arxiv: 1605.07678] 3
LeNet-5 Convolutional Neural Network Layer-wise Structure & Basic Operations CNN Operations Fully-Connected Neuron Convolution Pooling Normalization DNN Operations 5
LeNet-5 Convolutional Neural Network Layer-wise Structure & Basic Operations CNN on RRAM? Fully-Connected Neuron Convolution Pooling Normalization 6 DNN on RRAM (PRIME [ISCA 2016], ISAAC [ISCA 2016])
Convolutional Neural Network Input Image Layer-wise Structure & Basic Operations Convolution v.s. Fully-Connected Matrix-Vector Multiplication y = W x Convolver Circuit Weight Mapping Interface Design Sliding Window Data Buffering & Fetching Conv Layer 1 Pooling Layer 1 Conv Layer N Pooling Layer N One-to-One Mapping Pooling FC Layer (N+1) FC Layer (N+M) 7 Recognition Result
Convolutional Neural Network Input Image Layer-wise Structure & Basic Operations Conv Layer 1 Convolution v.s. Fully-Connected Matrix-Vector Multiplication Convolver Circuit Sliding Window Data Buffering & Fetching Normalization v.s. Neuron One-to-One Mapping Peripheral Convolver Circuit Linear or Non-Linear Function Pooling Sliding Window Data Buffering & Fetching Pooling Layer 1 Conv Layer N Pooling Layer N FC Layer (N+1) FC Layer (N+M) 8 Recognition Result
BCNN on RRAM Two Main Concerns for CNN on RRAM Design Convolver Circuit Design Intermediate Result Buffering & Fetching 9 3
BCNN on RRAM Two Main Concerns for CNN on RRAM Design Convolver Circuit Design X-axis: GOPS, NOT GFLOPS! Quantization is always Introduced. y = W x Weight Quantization Interface Quantization Intermediate Result Buffering & Fetching 10 3
BCNN on RRAM Two Main Concerns for CNN on RRAM Design Convolver Circuit Design Weight Mapping: Lower Bit-Level of Weights Lower Requirement for RRAM Representative Ability Interface Design: Lower Bit-Level of Neurons Lower Cost on ADC/DAC Interfaces Extreme Case:Binary Weights & Binary Interfaces Binary CNN on RRAM Intermediate Result Buffering & Fetching Line Buffer Structure & Pipelining Strategy 11 3
BCNN on RRAM Binary CNN Training Workflow: Binarize while Training BinaryNet [arxiv:1602.02830] XNOR-Net [Rastegari ECCV 2016] Inference: How to Map the Well-Trained BCNN onto RRAM? y b = Binarize(W b x b ) The Case when RRAM Crossbar Size > Weight Matrix Size Direct Mapping The Case when RRAM Crossbar Size < Weight Matrix Size e.g. crossbar column length = 128, Conv kernel size of VGG19 = 3*3*512 Matrix Splitting 12 3
BCNN on RRAM: Convolver Circuit How to map a large matrix onto a group of RRAM crossbars? Column Splitting x = x = (# RRAM Column < # Kernel) Row Splitting x = x = + + (RRAM Column Length < Kernel Size) Signal Splitting = Pos - Neg How to decide the bit-level of the partial sum? Current Solution: 4-bit interface to trade-off accuracy and energy efficiency by brute force searching. Future Work: Merge the partial sum quantization into the training process. 13
h BCNN on RRAM: Line Buffer & Pipeline Sliding Window Unnecessary to buffer the whole input feature maps. the Convolver circuit can awake (A) from sleep (S) once the input data of the Conv kernel size is achieved; Structure of Line Buffer introduced to cache and fetch intermediate feature map. (i) C in line buffers Feature Map From the Previous Convovler Circuit (f) (i) C in IN 0 CONV_EN MUX CTRL w OUT W (i) h w C in Feature Map To the Next Convovler Circuit 14
BCNN on RRAM: Line Buffer & Pipeline Basic Convolution Line Buffer Size << Feature Map Size Layer i Layer i x x Feature Map 3*3 Kernel 3*3 Kernel Feature Map Zero Padding: To Keep the Input/Output Feature Map Size the Same Layer i Awake When Sliding Along Sleep When Meeting the End of One Row Layer i x 3*3 Kernel x 3*3 Kernel 15 Feature Map Feature Map
Experimental Results Experimental Setup: Small Case: LeNet on MNIST (C5*20-S2-C5*50-S2-FC100-FC10) Unnecessary for Matrix Splitting Effects of Device Variation on Multi-bit/Binary CNN Model Mapping onto N-bit RRAM devices Large Case: AlexNet on ImageNet Necessary for Matrix Splitting (4bit ADC for Partial Sum Interface) Area and Energy Estimation on Multi-bit/Binary CNN Model Mapping onto N-bit RRAM platform Other Settings: Crossbar Size: (128x128) 16
Experimental Results Effects of Device Variation under Different Bit-Levels 17 CNN Model Multi-bit Model (Quantization Error) Dynamically Quantizing [Qiu. FPGA 2016] the Well-Trained Floating-Point Model into M bits Binary Model (Training Error) Well Trained Model under the BinaryNet Training WorkFlow (Mapping Error) RRAM Model With N-bit Representative Ability Assumption: Symmetric Conductance Range the k th Conductance Range: (g(k)- g, g(k)+ g), Device Variation: (- g, + g). Full Bit-Level Mode All 2 N Conductance Ranges are all in Use Binary Mode Only two Conductance Ranges are Picked from 2 N Ones.
Experimental Results Multi-bit/Binary CNN Error Rate of LeNet on MNIST: Effects of Device Variation Under Different Bit-Levels No Variation: Precise Mapping from the Quantized Model to RRAM (Baseline) Full Bit-Level Mode: Dynamic Quantization Error Only Binary Mode: Training Error Only With Variation: Full Bit-Level Mode: Dynamic Quantization Error + Variation Binary Mode: Training Error + Variation Binary Mode shows Better Robustness than Full Bit-Level Mode 18
Experimental Results Area and Energy Estimation of Different RRAM-based Crossbar PEs Area: FC Layer takes the largest part Energy: Conv Layer takes the largest part Input Interface: Mostly Saved Output Interface: Still takes a large portion 19
Conclusion In this paper, an RRAM crossbar-based accelerator is proposed for BCNN forward process. The matrix splitting problem The pipeline implementation. The robustness of BCNN on RRAM under device variation are demonstrated. Experimental results show that BCNN introduces negligible recognition accuracy loss for LeNet on MNIST. For AlexNet on ImageNet, the RRAM-based BCNN accelerator saves 58.2% energy consumption and 56.8% area compared with multi-bit CNN structure. 20
Thanks for your Attention 21