Binary Convolutional Neural Network on RRAM

Similar documents
Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network

Neural Network-Hardware Co-design for Scalable RRAM-based BNN Accelerators

MNSIM: A Simulation Platform for Memristor-based Neuromorphic Computing System

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Revolutionizing the Datacenter

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Real-time object detection towards high power efficiency

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Convolutional Neural Networks

Computation-oriented Fault-tolerance Schemes for RRAM-based Computing Systems

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

arxiv: v1 [cs.cv] 11 Feb 2018

C-Brain: A Deep Learning Accelerator

CNN optimization. Rassadin A

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Spatial Localization and Detection. Lecture 8-1

DNN Accelerator Architectures

Deep Learning Accelerators

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

Training Deep Convolutional Neural Networks with Resistive Cross-Point Devices

Fuzzy Set Theory in Computer Vision: Example 3

Deep Residual Learning

arxiv: v2 [cs.ar] 15 May 2018

A Method to Estimate the Energy Consumption of Deep Neural Networks

Using Machine Learning for Classification of Cancer Cells

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

SOFTWARE HARDWARE CODESIGN ACCELERATION FOR EFFICIENT NEURAL NETWORK. ...Deep learning and neural

Machine Learning 13. week

ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA

Channel Locality Block: A Variant of Squeeze-and-Excitation

Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets)

Research Faculty Summit Systems Fueling future disruptions

Flow-Based Video Recognition

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

INTRODUCTION TO DEEP LEARNING

Convolutional Neural Networks + Neural Style Transfer. Justin Johnson 2/1/2017

How to Estimate the Energy Consumption of Deep Neural Networks

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

ComPEND: Computation Pruning through Early Negative Detection for ReLU in a Deep Neural Network Accelerator

Energy-Efficient SQL Query Exploiting RRAM-based Process-in-Memory Structure

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Object Detection. Part1. Presenter: Dae-Yong

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Know your data - many types of networks

Research Faculty Summit Systems Fueling future disruptions

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Yiqi Yan. May 10, 2017

Is Bigger CNN Better? Samer Hijazi on behalf of IPG CTO Group Embedded Neural Networks Summit (enns2016) San Jose Feb. 9th

Instruction Driven Cross-Layer CNN Accelerator with Winograd Transformation on FPGA

Deep Learning with Tensorflow AlexNet

Computer Vision Lecture 16

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks

Lecture 12: Model Serving. CSE599W: Spring 2018

arxiv: v1 [cs.cv] 20 May 2016

Learning Deep Representations for Visual Recognition

High Performance Computing

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

Embedded Binarized Neural Networks

Towards Weakly- and Semi- Supervised Object Localization and Semantic Segmentation

Cascade Region Regression for Robust Object Detection

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

Optimize Deep Convolutional Neural Network with Ternarized Weights and High Accuracy

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

Bridging the Gap Between Neural Networks and Neuromorphic Hardware with A Neural Network Compiler

Scaling Neural Network Acceleration using Coarse-Grained Parallelism

MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Fully Convolutional Networks for Semantic Segmentation

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Outline. Deep Convolutional Neural Network (DCNN) Stochastic Computing (SC)

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

RECURRENT NEURAL NETWORKS

Optimize Deep Convolutional Neural Network with Ternarized Weights and High Accuracy

Fully-Convolutional Siamese Networks for Object Tracking

Deep Learning Requirements for Autonomous Vehicles

HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Fault-Tolerant Training Enabled by On-Line Fault Detection for RRAM-Based Neural Computing Systems

Design Exploration of FPGA-based Accelerators for Deep Neural Networks

Model Compression. Girish Varma IIIT Hyderabad

Computer Vision Lecture 16

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm

Xilinx ML Suite Overview

Transcription:

Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua University, Beijing, China e-mail: yu-wang@mail.tsinghua.edu.cn

Outline Background & Motivation RRAM-based Binary CNN Accelerator Design System Overview Convolver Circuit Line Buffer & Pipeline Experimental Results Comparison between BCNN on RRAM and Multi-bit CNN on RRAM Recognition Accuracy under Device Variation Area and Energy Cost Saving Conclusion 2

Convolutional Neural Network Popular in Recent Ten Years, Good Performance for Wide Ranges of Applications Generalized Recognition (CNN) Object Detection & Localization Image Caption Specialized Recognition (CNN) Lane Detection & Vehicle Detection Pedestrian Detection Face Recognition Besides Vision Tasks (CNN + other) 3 Speech Recognition Natural Language Processing Chess & Go (CNN + RNN) (CNN + LSTM) (CNN + Reinforcement Learning)

Convolutional Neural Network Popular in Recent Ten Years, Good Performance for Wide Ranges of Applications Winners of the Image-Net Large-Scale Visual Recognition Challenge (ILSVRC) Task: Classification & Localization with Provided Data 2011 2012 2013 2014 2015 2016 Team XRCE SuperVision Clarifai VGG MSRA Trimps-Soushen Model Not CNN AlexNet ZF VGG-16 ResNet-152 Ensemble Err (Top-5) 25.8% 16.4% 11.7% 7.4% 3.57% 2.99% 2016 2014 2014 For the Current Popular CNN Model (ResNet Series), <5% Top-5 Error Rate 10-50MWeights for Single Model >10G Operations per Inference 2012 A High Energy Efficiency Platform is Required for CNN Applications! 4 [arxiv: 1605.07678] 3

LeNet-5 Convolutional Neural Network Layer-wise Structure & Basic Operations CNN Operations Fully-Connected Neuron Convolution Pooling Normalization DNN Operations 5

LeNet-5 Convolutional Neural Network Layer-wise Structure & Basic Operations CNN on RRAM? Fully-Connected Neuron Convolution Pooling Normalization 6 DNN on RRAM (PRIME [ISCA 2016], ISAAC [ISCA 2016])

Convolutional Neural Network Input Image Layer-wise Structure & Basic Operations Convolution v.s. Fully-Connected Matrix-Vector Multiplication y = W x Convolver Circuit Weight Mapping Interface Design Sliding Window Data Buffering & Fetching Conv Layer 1 Pooling Layer 1 Conv Layer N Pooling Layer N One-to-One Mapping Pooling FC Layer (N+1) FC Layer (N+M) 7 Recognition Result

Convolutional Neural Network Input Image Layer-wise Structure & Basic Operations Conv Layer 1 Convolution v.s. Fully-Connected Matrix-Vector Multiplication Convolver Circuit Sliding Window Data Buffering & Fetching Normalization v.s. Neuron One-to-One Mapping Peripheral Convolver Circuit Linear or Non-Linear Function Pooling Sliding Window Data Buffering & Fetching Pooling Layer 1 Conv Layer N Pooling Layer N FC Layer (N+1) FC Layer (N+M) 8 Recognition Result

BCNN on RRAM Two Main Concerns for CNN on RRAM Design Convolver Circuit Design Intermediate Result Buffering & Fetching 9 3

BCNN on RRAM Two Main Concerns for CNN on RRAM Design Convolver Circuit Design X-axis: GOPS, NOT GFLOPS! Quantization is always Introduced. y = W x Weight Quantization Interface Quantization Intermediate Result Buffering & Fetching 10 3

BCNN on RRAM Two Main Concerns for CNN on RRAM Design Convolver Circuit Design Weight Mapping: Lower Bit-Level of Weights Lower Requirement for RRAM Representative Ability Interface Design: Lower Bit-Level of Neurons Lower Cost on ADC/DAC Interfaces Extreme Case:Binary Weights & Binary Interfaces Binary CNN on RRAM Intermediate Result Buffering & Fetching Line Buffer Structure & Pipelining Strategy 11 3

BCNN on RRAM Binary CNN Training Workflow: Binarize while Training BinaryNet [arxiv:1602.02830] XNOR-Net [Rastegari ECCV 2016] Inference: How to Map the Well-Trained BCNN onto RRAM? y b = Binarize(W b x b ) The Case when RRAM Crossbar Size > Weight Matrix Size Direct Mapping The Case when RRAM Crossbar Size < Weight Matrix Size e.g. crossbar column length = 128, Conv kernel size of VGG19 = 3*3*512 Matrix Splitting 12 3

BCNN on RRAM: Convolver Circuit How to map a large matrix onto a group of RRAM crossbars? Column Splitting x = x = (# RRAM Column < # Kernel) Row Splitting x = x = + + (RRAM Column Length < Kernel Size) Signal Splitting = Pos - Neg How to decide the bit-level of the partial sum? Current Solution: 4-bit interface to trade-off accuracy and energy efficiency by brute force searching. Future Work: Merge the partial sum quantization into the training process. 13

h BCNN on RRAM: Line Buffer & Pipeline Sliding Window Unnecessary to buffer the whole input feature maps. the Convolver circuit can awake (A) from sleep (S) once the input data of the Conv kernel size is achieved; Structure of Line Buffer introduced to cache and fetch intermediate feature map. (i) C in line buffers Feature Map From the Previous Convovler Circuit (f) (i) C in IN 0 CONV_EN MUX CTRL w OUT W (i) h w C in Feature Map To the Next Convovler Circuit 14

BCNN on RRAM: Line Buffer & Pipeline Basic Convolution Line Buffer Size << Feature Map Size Layer i Layer i x x Feature Map 3*3 Kernel 3*3 Kernel Feature Map Zero Padding: To Keep the Input/Output Feature Map Size the Same Layer i Awake When Sliding Along Sleep When Meeting the End of One Row Layer i x 3*3 Kernel x 3*3 Kernel 15 Feature Map Feature Map

Experimental Results Experimental Setup: Small Case: LeNet on MNIST (C5*20-S2-C5*50-S2-FC100-FC10) Unnecessary for Matrix Splitting Effects of Device Variation on Multi-bit/Binary CNN Model Mapping onto N-bit RRAM devices Large Case: AlexNet on ImageNet Necessary for Matrix Splitting (4bit ADC for Partial Sum Interface) Area and Energy Estimation on Multi-bit/Binary CNN Model Mapping onto N-bit RRAM platform Other Settings: Crossbar Size: (128x128) 16

Experimental Results Effects of Device Variation under Different Bit-Levels 17 CNN Model Multi-bit Model (Quantization Error) Dynamically Quantizing [Qiu. FPGA 2016] the Well-Trained Floating-Point Model into M bits Binary Model (Training Error) Well Trained Model under the BinaryNet Training WorkFlow (Mapping Error) RRAM Model With N-bit Representative Ability Assumption: Symmetric Conductance Range the k th Conductance Range: (g(k)- g, g(k)+ g), Device Variation: (- g, + g). Full Bit-Level Mode All 2 N Conductance Ranges are all in Use Binary Mode Only two Conductance Ranges are Picked from 2 N Ones.

Experimental Results Multi-bit/Binary CNN Error Rate of LeNet on MNIST: Effects of Device Variation Under Different Bit-Levels No Variation: Precise Mapping from the Quantized Model to RRAM (Baseline) Full Bit-Level Mode: Dynamic Quantization Error Only Binary Mode: Training Error Only With Variation: Full Bit-Level Mode: Dynamic Quantization Error + Variation Binary Mode: Training Error + Variation Binary Mode shows Better Robustness than Full Bit-Level Mode 18

Experimental Results Area and Energy Estimation of Different RRAM-based Crossbar PEs Area: FC Layer takes the largest part Energy: Conv Layer takes the largest part Input Interface: Mostly Saved Output Interface: Still takes a large portion 19

Conclusion In this paper, an RRAM crossbar-based accelerator is proposed for BCNN forward process. The matrix splitting problem The pipeline implementation. The robustness of BCNN on RRAM under device variation are demonstrated. Experimental results show that BCNN introduces negligible recognition accuracy loss for LeNet on MNIST. For AlexNet on ImageNet, the RRAM-based BCNN accelerator saves 58.2% energy consumption and 56.8% area compared with multi-bit CNN structure. 20

Thanks for your Attention 21